CS236781: Deep Learning on Computational Accelerators¶

Homework Assignment 4¶

Faculty of Computer Science, Technion.

Submitted by:

# Name Id email
Student 1 Eden Dembinsky 212227888 edendem@campus.technion.ac.il
Student 2 Assaf Lovton 209844414 assaflovton@campus.technion.ac.il

Introduction¶

In this assignment we'll explore deep reinforcement learning. We'll implement two popular and related methods for directly learning the policy of an agent for playing a simple video game. Then we'll focus our attention on image generation and implement two different generative models: A variational autoencoder and a generative adversarial network.

General Guidelines¶

  • Please read the getting started page on the course website. It explains how to setup, run and submit the assignment.
  • This assignment requires running on GPU-enabled hardware. Please read the course servers usage guide. It explains how to use and run your code on the course servers to benefit from training with GPUs.
  • The text and code cells in these notebooks are intended to guide you through the assignment and help you verify your solutions. The notebooks do not need to be edited at all (unless explicitly specified). The only exception is to fill your name(s) in the above cell before submission. Please do not remove sections or change the order of any cells.
  • All your code (and even answers to questions) should be written in the files within the python package corresponding the assignment number (hw1, hw2, etc). You can of course use any editor or IDE to work on these files.

Contents¶

  • Part1: Deep Reinforcement Learning
  • Part 2: Variational Autoencoder
  • Part 3: Generative Adversarial Networks
  • Part4: Summary Questions
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} $$

Part 1: Deep Reinforcement Learning¶

In the tutorial we have seen value-based reinforcement learning, in which we learn to approximate the action-value function $q(s,a)$.

In this exercise we'll explore a different approach, directly learning the agent's policy distribution, $\pi(a|s)$ by using policy gradients, in order to safely land on the moon!

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Prefer CPU, GPU won't help much in this assignment
print('Using device:', device)
device = 'cpu'
# Seed for deterministic tests
SEED = 42
Using device: cuda

Some technical notes before we begin:

  • This part does not require a GPU. We won't need large models, and the computation bottleneck will be the generation of episodes to train on.
  • In order to run this notebook on the server, you must prepend the xvfb-run command to create a virtual screen. For example,
    • to run this notebook with srun do
        srun -c2 --gres=gpu:1 xvfb-run -a -s "-screen 0 1440x900x24" python main.py run-nb <filename>
    • To run the submission script, do
        srun -c2 xvfb-run -a -s "-screen 0 1440x900x24" python main.py prepare-submission ...
    • note that we have already included the xvfb-run command inside the jupyter-lab.sh script, so you can use it as usual with srun. and so on.
  • The OpenAI gym library is not officially supported on windows. However it should be possible to install and run the necessary environment for this exercise. However, we cannot provide you with technical support for this. If you have trouble installing locally, we suggest running on the course server.
  • When running the gym environment locally (i.e. not on the course server), an interactive window should appear, showing you the gameplay. There's currently a known issue when running this through jupyter: the window may remain open and seem stuck after the episode completes. If it happens, this is OK, you can keep running the notebook and the rest of the cells wont be affected. The Window will close properly when you shut down the kernel.

Policy gradients¶

Recall from the tutorial that we define the policy of an agent as the conditional distribution, $$ \pi(a|s) = \Pr(a_t=a\vert s_t=s), $$ which defines how likely the agent is to take action $a$ at state $s$.

Furthermore we define the action-value function, $$ q_{\pi}(s,a) = \E{g_t(\tau)|s_t = s,a_t=a,\pi} $$ where $$ g_t(\tau) = r_{t+1}+\gamma r_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+1+k}, $$ is the total discounted reward of a specific trajectory $\tau$ from time $t$, and the expectation in $q$ is over all possible trajectories, $ \tau=\left\{ (s_0,a_0,r_1,s_1), \dots (s_T,a_T,r_{T+1},s_{T+1}) \right\}. $

In the tutorial we saw that we can learn a value function starting with some random function and updating it iteratively by using the Bellman optimality equation. Given that we have some action-value function, we can immediately create a policy based on that by simply selecting an action which maximize the action-value at the current state, i.e. $$ \pi(a|s) = \begin{cases} 1, & a = \arg\max_{a'\in\cset{A}} q(s,a') \\ 0, & \text{else} \end{cases}. $$ This is called $q$-learning. This approach aims to obtain a policy indirectly through the action-value function. Yet, in most cases we don't actually care about knowing the value of particular states, since all we need is a good policy for our agent.

Here we'll take a different approach and learn a policy distribution $\pi(a|s)$ directly - by using policy gradients.

Formalism¶

We define a parametric policy, $\pi_\vec{\theta}(a|s)$, and maximize total discounted reward (or minimize the negative reward): $$ \mathcal{L}(\vec{\theta})=\E[\tau]{-g(\tau)|\pi_\vec{\theta}} = -\int g(\tau)p(\tau|\vec{\theta})d\tau, $$ where $p(\tau|\vec{\theta})$ is the probability of a specific trajectory $\tau$ under the policy defined by $\vec{\theta}$.

Since we want to find the parameters $\vec{\theta}$ which minimize $\mathcal{L}(\vec{\theta})$, we'll compute the gradient w.r.t. $\vec{\theta}$: $$ \grad\mathcal{L}(\vec{\theta}) = -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau. $$

Unfortunately, if we try to write $p(\tau|\vec{\theta})$ explicitly, we find that computing it's gradient with respect to $\vec{\theta}$ is quite intractable due to a huge product of terms depending on $\vec{\theta}$: $$ p(\tau|\vec{\theta})=p\left(\left\{ (s_t,a_t,r_{t+1},s_{t+1})\right\}_{t\geq0}\given\vec{\theta}\right) =p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t). $$

However, by using the fact that $\grad_{x}\log(f(x))=\frac{\grad_{x}f(x)}{f(x)}$, we can convert the product into a sum: $$ \begin{align} \grad\mathcal{L}(\vec{\theta}) &= -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau = -\int g(\tau)\frac{\grad p(\tau|\vec{\theta})}{p(\tau|\vec{\theta})}p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left(p(\tau|\vec{\theta})\right)p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left( p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\left( \log p(s_0) + \sum_{t\geq0} \log \pi_{\vec{\theta}}(a_t|s_t) + \sum_{t\geq0}\log p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t) p(\tau|\vec{\theta})d\tau \\ &= \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. \end{align} $$

This is the "vanilla" version of the policy gradient. We can interpret is as a weighted log-likelihood function. The log-policy is the log-likelihood term we wish to maximize and the total discounted reward acts as a weight: high-return positive trajectories will cause the probability of actions taken during them to increase, and negative-return trajectories will cause the probabilities of actions taken to decrease.

In the following figures we see three trajectories: high-return positive-reward (green), low-return positive-reward (yellow) and negative-return (red) and the action probabilities along the trajectories after the update. Credit: Sergey Levine.

The major drawback of the policy-gradient is it's high variance, which causes erratic optimization behavior and therefore slow convergence. One reason for this is that the log-policy weight term, $g(\tau)$ can vary wildly between different trajectories, even if they're similar in actions. Later on we'll implement the loss and explore some methods of variance reduction.

Landing on the moon with policy gradients¶

In the spirit of the recent achievements of the Israeli space industry, we'll apply our reinforcement learning skills to solve a simple game called LunarLander.

This game is available as an environment in OpenAI gym.

In this environment, you need to control the lander and get it to land safely on the moon. To do so, you must apply bottom, right or left thrusters (each are either fully on or fully off) and get it to land within the designated zone as quickly as possible and with minimal wasted fuel.

In [3]:
import gym

# Just for fun :) ... but also to re-define the default max number of steps
ENV_NAME = 'Beresheet-v2'
MAX_EPISODE_STEPS = 300
if ENV_NAME not in gym.envs.registry.env_specs:
    gym.register(
        id=ENV_NAME,
        entry_point='gym.envs.box2d:LunarLander',
        max_episode_steps=MAX_EPISODE_STEPS,
        reward_threshold=200,
    )
In [4]:
import gym

env = gym.make(ENV_NAME)

print(env)
print(f'observations space: {env.observation_space}')
print(f'action space: {env.action_space}')

ENV_N_ACTIONS = env.action_space.n
ENV_N_OBSERVATIONS = env.observation_space.shape[0]
<TimeLimit<LunarLander<Beresheet-v2>>>
observations space: Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32)
action space: Discrete(4)

The observations at each step is the Lander's position, velocity, angle, angular velocity and ground contact state. The actions are no-op, fire left truster, bottom thruster and right thruster.

You are highly encouraged to read the documentation in the source code of the LunarLander environment to understand the reward system, and see how the actions and observations are created.

Policy network and Agent¶

Let's start with our policy-model. This will be a simple neural net, which should take an observation and return a score for each possible action.

TODO:

  1. Implement all methods in the PolicyNet class in the hw4/rl_pg.py module. Start small. A simple MLP with a few hidden layers is a good starting point. You can come back and change it later based on the the experiments.
    Notice that we'll use the build_for_env method to instantiate a PolicyNet based on the configuration of a given environment.
  2. If you need hyperparameters to configure your model (e.g. number of hidden layers, sizes, etc.), add them in part1_pg_hyperparams() in hw4/answers.py.
In [5]:
import hw4.rl_pg as hw4pg
import hw4.answers

hp = hw4.answers.part1_pg_hyperparams()

# You can add keyword-args to this function which will be populated from the
# hyperparameters dict.
p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
p_net
Out[5]:
PolicyNet(
  (fc): Sequential(
    (0): Linear(in_features=8, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=4, bias=True)
  )
)

Now we need an agent. The purpose of our agent will be to act according to the current policy and generate experiences. Our PolicyAgent will use a PolicyNet as the current policy function.

We'll also define some extra datatypes to help us represent the data generated by our agent. You can find the Experience, Episode and TrainBatch datatypes in the hw4/rl_data.py module.

TODO: Implement the current_action_distribution() method of the PolicyAgent class in the hw4/rl_pg.py module.

In [6]:
for i in range (10):
    agent = hw4pg.PolicyAgent(env, p_net, device)
    d = agent.current_action_distribution()
    
    test.assertSequenceEqual(d.shape, (env.action_space.n,))
    test.assertAlmostEqual(d.sum(), 1.0, delta=1e-5)
    
print(d)
tensor([0.2422, 0.2707, 0.2406, 0.2465], grad_fn=<ViewBackward0>)

TODO: Implement the step() method of the PolicyAgent.

In [7]:
agent = hw4pg.PolicyAgent(env, p_net, device)
exp = agent.step()

test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([ 0.0067,  1.4130,  0.6834,  0.0905, -0.0078, -0.1548,  0.0000,  0.0000]), action=1, reward=0.5251912434224277, is_done=False)

To test our agent, we'll write some code that allows it to play an environment. We'll use the Monitor wrapper in gym to generate a video of the episode for visual debugging.

TODO: Complete the implementation of the monitor_episode() method of the PolicyAgent.

In [8]:
env, n_steps, reward = agent.monitor_episode(ENV_NAME, p_net, device=device)

To display the Monitor video in this notebook, we'll use a helper function from our jupyter_utils and a small wrapper that extracts the path of the last video file.

In [9]:
import cs236781.jupyter_utils as jupyter_utils

def show_monitor_video(monitor_env, idx=0, **kw):
    # Extract video path
    video_path = monitor_env.videos[idx][0]
    video_path = os.path.relpath(video_path, start=os.path.curdir)
    
    # Use helper function to embed the video
    return jupyter_utils.show_video_in_notebook(video_path, **kw)
In [10]:
print(f'Episode ran for {n_steps} steps. Total reward: {reward:.2f}')

show_monitor_video(env, idx=0)
Episode ran for 82 steps. Total reward: -141.00
Out[10]:

Training data¶

The next step is to create data to train on. We need to train on batches of state-action pairs, so that our network can learn to predict the actions.

We'll split this task into three parts:

  1. Generate a batch of Episodes, by using an Agent that's playing according to our current policy network. Each Episode object contains the Experience objects created by the agent.
  2. Calculate the total discounted reward for each state we encountered and action we took. This is our action-value estimate.
  3. Convert the Episodes into a batch of tensors to train on. Each batch will contain states, action taken per state, reward accrued, and the calculated estimated state-values. These will be stored in a TrainBatch object.

TODO: Complete the implementation of the episode_batch_generator() method in the TrainBatchDataset class within the hw4.rl_data module. This will address part 1 in the list above.

In [11]:
import hw4.rl_data as hw4data

def agent_fn():
    env = gym.make(ENV_NAME)
    hp = hw4.answers.part1_pg_hyperparams()
    p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
    return hw4pg.PolicyAgent(env, p_net, device)
    
ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
batch_gen = ds.episode_batch_generator()
b = next(batch_gen)
print('First episode:', b[0])

test.assertEqual(len(b), 8)
for ep in b:
    test.assertIsInstance(ep, hw4data.Episode)
    
    # Check that it's a full episode
    is_done = [exp.is_done for exp in ep.experiences]
    test.assertFalse(any(is_done[0:-1]))
    test.assertTrue(is_done[-1])
First episode: Episode(total_reward=-210.69, #experences=95)

TODO: Complete the implementation of the calc_qvals() method in the Episode class. This will address part 2. These q-values are an estimate of the actual action value function: $$\hat{q}_{t} = \sum_{t'\geq t} \gamma^{t'-t}r_{t'+1}.$$

In [12]:
np.random.seed(SEED)
test_rewards = np.random.randint(-10, 10, 100)
test_experiences = [hw4pg.Experience(None,None,r,False) for r in test_rewards] 
test_episode = hw4data.Episode(np.sum(test_rewards), test_experiences)

qvals = test_episode.calc_qvals(0.9)
qvals = list(qvals)

expected_qvals = np.load(os.path.join('tests', 'assets', 'part1_expected_qvals.npy'))
for i in range(len(test_rewards)):
    test.assertAlmostEqual(expected_qvals[i], qvals[i], delta=1e-3)

TODO: Complete the implementation of the from_episodes() method in the TrainBatch class. This will address part 3.

Notes:

  • The TrainBatchDataset class provides a generator function that will use the above function to lazily generate batches of training samples and labels on demand.
  • This allows us to use a standard PyTorch dataloader to wrap our Dataset and provide us with parallel data loading for free! This means we can run multiple environments with multiple agents in separate background processes to generate data for training and thus prevent the data loading bottleneck which is caused by the fact that we must generate full Episodes to train on in order to calculate the q-values.
  • We'll set the DataLoader's batch_size to None because we have already implemented custom batching in our dataset.
  • You can choose the number of worker processes generating data using the num_workers parameter in the hyperparams dict. Set num_workers=0 to disable parallelization.
In [13]:
from torch.utils.data import DataLoader

hp = hw4.answers.part1_pg_hyperparams()

ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
dl = DataLoader(
    ds,
    batch_size=None,
    num_workers=hp['num_workers'],
    multiprocessing_context='fork' if hp['num_workers'] > 0 else None
)


for i, train_batch in enumerate(dl):
    states, actions, qvals, reward_mean = train_batch
    print(f'#{i}: {train_batch}', end="\n\n")
    test.assertEqual(states.shape[0], actions.shape[0])
    test.assertEqual(qvals.shape[0], actions.shape[0])
    test.assertEqual(states.shape[1], env.observation_space.shape[0])
    if i > 1:
        break
#0: TrainBatch(states: torch.Size([714, 8]), actions: torch.Size([714]), q_vals: torch.Size([714])), num_episodes: 8)

#1: TrainBatch(states: torch.Size([625, 8]), actions: torch.Size([625]), q_vals: torch.Size([625])), num_episodes: 8)

#2: TrainBatch(states: torch.Size([712, 8]), actions: torch.Size([712]), q_vals: torch.Size([712])), num_episodes: 8)

Loss functions¶

As usual, we need a loss function to optimize over. We'll calculate three types of losses:

  1. The causal vanilla policy gradient loss.
  2. The policy gradient loss, with a baseline to reduce variance.
  3. An entropy-based loss whos purpose is to diversify the agent's action selection, and prevent it from being "too sure" about its actions. This loss will be used together with one of the above losses.

Causal vanilla policy-gradient¶

We have derived the policy-gradient as $$ \grad\mathcal{L}(\vec{\theta}) = \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$

By writing the discounted reward explicitly and enforcing causality, i.e. the action taken at time $t$ can't affect the reward at time $t'<t$, we can get a slightly lower-variance version of the policy gradient:

$$ \grad\mathcal{L}_{\text{PG}}(\vec{\theta}) = \E[\tau]{-\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'-t}r_{t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$

In practice, the expectation over trajectories is calculated using a Monte-Carlo approach, i.e. simply sampling $N$ trajectories and average the term inside the expectation. Therefore, we will use the following estimated version of the policy gradient:

$$ \begin{align} \hat\grad\mathcal{L}_{\text{PG}}(\vec{\theta}) &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'-t}r_{i,t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}) \\ &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \hat{q}_{i,t} \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). \end{align} $$

Note the use of the notation $\hat{q}_{i,t}$ to represent the estimated action-value at time $t$ in the sampled trajectory $i$. Here $\hat{q}_{i,t}$ is acting as the weight-term for the policy gradient.

TODO: Complete the implementation of the VanillaPolicyGradientLoss class in the hw4/rl_pg.py module.

In [14]:
# Ensure deterministic run
env = gym.make(ENV_NAME)
env.seed(SEED)
torch.manual_seed(SEED)

def agent_fn():
    # Use a simple "network" here, so that this test doesn't depend on
    # your specific PolicyNet implementation
    p_net_test = nn.Linear(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, bias=True)
    agent = hw4pg.PolicyAgent(env, p_net_test)
    return agent

dataloader = hw4data.TrainBatchDataset(agent_fn, gamma=0.9, episode_batch_size=4)

test_batch = next(iter(dataloader))
test_action_scores = torch.randn(len(test_batch), env.action_space.n)
print(f"{test_batch=}", end='\n\n')
print(f"test_action_scores=\n{test_action_scores}\nshape={test_action_scores.shape}", end='\n\n')

loss_fn_p = hw4pg.VanillaPolicyGradientLoss()
loss_p, _ = loss_fn_p(test_batch, test_action_scores)

print(f'{loss_p=}')
test.assertAlmostEqual(loss_p.item(), -48.560, delta=1e-2)
test_batch=TrainBatch(states: torch.Size([375, 8]), actions: torch.Size([375]), q_vals: torch.Size([375])), num_episodes: 4)

test_action_scores=
tensor([[ 0.8932,  0.4749,  0.8569, -0.7365],
        [-0.7853,  1.0901, -0.0665,  1.2573],
        [ 0.0867, -1.2705, -0.1987, -0.4103],
        ...,
        [-0.7778, -2.4352,  0.1117,  0.9482],
        [-1.4593, -0.0609, -0.1148,  1.5804],
        [ 1.2975, -0.3326, -1.0626,  0.3869]])
shape=torch.Size([375, 4])

loss_p=tensor(-48.5605)

Policy-gradient with baseline¶

Another way to reduce the variance of our gradient is to use relative weighting of the log-policy instead of absolute reward values. $$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$ In other words, we don't measure a trajectory's worth by it's total reward, but by how much better that total reward is relative to some expected ("baseline") reward value, denoted above by $b$. Note that subtracting a baseline has no effect on the expected value of the policy gradient. It's easy to prove this directly by definition.

Here we'll implement a very simple baseline (not optimal in terms of variance reduction): the average of the estimated state-values $\hat{q}_{i,t}$.

TODO: Complete the implementation of the BaselinePolicyGradientLoss class in the hw4/rl_pg.py module.

In [15]:
# Using the same batch and action_scores from above cell
loss_fn_p = hw4pg.BaselinePolicyGradientLoss()
loss_p, loss_dict = loss_fn_p(test_batch, test_action_scores)

print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['baseline'], -29.841, delta=1e-2)
test.assertAlmostEqual(loss_p.item(), 1.297, delta=1e-2)
loss_dict={'loss_p': 1.2976999282836914, 'baseline': -29.84125328063965}

Entropy loss¶

The entropy of a probability distribution (in our case the policy), is $$ H(\pi) = -\sum_{a} \pi(a|s)\log\pi(a|s). $$ The entropy is always positive and obtains it's maximum for a uniform distribution. We'll use the entropy of the policy as a bonus, i.e. we'll try to maximize it. The idea is the prevent the policy distribution from becoming too narrow and thus promote the agent's exploration.

First, we'll calculate the maximal possible entropy value of the action distribution for a set number of possible actions. This will be used as a normalization term.

TODO: Complete the implementation of the calc_max_entropy() method in the ActionEntropyLoss class.

In [16]:
loss_fn_e = hw4pg.ActionEntropyLoss(env.action_space.n)
print('max_entropy = ', loss_fn_e.max_entropy)

test.assertAlmostEqual(loss_fn_e.max_entropy, 1.38629436, delta=1e-3)
max_entropy =  1.3862943611198906

TODO: Complete the implementation of the forward() method in the ActionEntropyLoss class.

In [17]:
loss_e, _ = loss_fn_e(test_batch, test_action_scores)
print('loss = ', loss_e)

test.assertAlmostEqual(loss_e.item(), -0.8103, delta=1e-2)
loss =  tensor(-0.8106)

Training¶

We'll implement our training procedure as follows:

  1. Initialize the current policy to be a random policy.
  2. Sample $N$ trajectories from the environment using the current policy.
  3. Calculate the estimated $q$-values, $\hat{q}_{i,t} = \sum_{t'\geq t} \gamma^{t'}r_{i,t'+1}$ for each trajectory $i$.
  4. Calculate policy gradient estimate $\hat\grad\mathcal{L}(\vec{\theta})$ as defined above.
  5. Perform SGD update $\vec{\theta}\leftarrow\vec{\theta}-\eta\hat\grad\mathcal{L}(\vec{\theta})$.
  6. Repeat from step 2.

This is known as the REINFORCE algorithm.

Fortunately, we've already implemented everything we need for steps 1-4 so we need only a bit more code to put it all together.

The following block implements a wrapper, train_pg to create all the objects we need in order to train our policy gradient model.

In [18]:
import hw4.answers
from functools import partial

ENV_NAME = "Beresheet-v2"

def agent_fn_train(agent_type, p_net, seed, envs_dict):
    winfo = torch.utils.data.get_worker_info()
    wid = winfo.id if winfo else 0
    seed = seed + wid if seed else wid

    env = gym.make(ENV_NAME)
    envs_dict[wid] = env
    env.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

    return agent_type(env, p_net)

def train_rl(agent_type, net_type, loss_fns, hp, seed=None, checkpoints_file=None, **train_kw):
    print(f'hyperparams: {hp}')
    
    envs = {}
    p_net = net_type(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, **hp)
    p_net.share_memory()
    agent_fn = partial(agent_fn_train, agent_type, p_net, seed, envs)
    
    dataset = hw4data.TrainBatchDataset(agent_fn, hp['batch_size'], hp['gamma'])
    dataloader = DataLoader(
        dataset, batch_size=None,
        num_workers=hp['num_workers'],
        multiprocessing_context='fork' if hp['num_workers'] > 0 else None
    )
    optimizer = optim.Adam(p_net.parameters(), lr=hp['learn_rate'], eps=hp['eps'])
    
    trainer = hw4pg.PolicyTrainer(p_net, optimizer, loss_fns, dataloader, checkpoints_file)
    try:
        trainer.train(**train_kw)
    except KeyboardInterrupt as e:
        print('Training interrupted by user.')
    finally:
        for env in envs.values():
            env.close()

    # Include final model state
    training_data = trainer.training_data
    training_data['model_state'] = p_net.state_dict()
    return training_data
    
def train_pg(baseline=False, entropy=False, **train_kwargs):
    hp = hw4.answers.part1_pg_hyperparams()
    
    loss_fns = []
    if baseline:
        loss_fns.append(hw4pg.BaselinePolicyGradientLoss())
    else:
        loss_fns.append(hw4pg.VanillaPolicyGradientLoss())
    if entropy:
        loss_fns.append(hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta']))

    return train_rl(hw4pg.PolicyAgent, hw4pg.PolicyNet, loss_fns, hp, **train_kwargs)

The PolicyTrainer class implements the training loop, collects the losses and rewards and provides some useful checkpointing functionality. The training loop will generate batches of episodes and train on them until either:

  • The average total reward from the last running_mean_len episodes is greater than the target_reward, OR
  • The number of generated episodes reached max_episodes.

Most of this class is already implemented for you.

TODO:

  1. Complete the training loop by implementing the train_batch() method of the PolicyTrainer.
  2. Tweak the hyperparameters in the part1_pg_hyperparams() function within the hw4/answers.py module as needed. You get some sane defaults.

Let's check whether our model is actually training. We'll try to reach a very low (bad) target reward, just as a sanity check to see that training works. Your model should be able to reach this target reward within a few batches.

You can increase the target reward and use this block to manually tweak your model and hyperparameters a few times.

In [19]:
target_reward = -140 # VERY LOW target
train_data = train_pg(target_reward=target_reward, seed=SEED, max_episodes=2000, running_mean_len=10)

test.assertGreater(train_data['mean_reward'][-1], target_reward)
hyperparams: {'batch_size': 30, 'gamma': 0.985, 'beta': 0.4, 'learn_rate': 0.02, 'eps': 1e-08, 'num_workers': 0, 'hl': [128], 'b': True}
=== Training...
#9: step=00024312, loss_p=-59.89, m_reward(10)=-126.8 (best=-178.5):  15%|█▌        | 300/2000 [00:14<01:24, 20.02it/s] 

=== 🚀 SOLVED - Target reward reached! 🚀

Experimenting with different losses¶

We'll now run a few experiments to see the effect of diferent loss functions on the training dynamics. Namely, we'll try:

  1. Vanilla PG (vpg): No baseline, no entropy
  2. Baseline PG (bpg): Baseline, no entropy loss
  3. Entropy PG (epg): No baseline, with entropy loss
  4. Combined PG (cpg): Baseline, with entropy loss
In [20]:
from collections import namedtuple
from pprint import pprint
import itertools as it


ExpConfig = namedtuple('ExpConfig', ('name','baseline','entropy'))

def exp_configs():
    exp_names = ('vpg', 'epg', 'bpg', 'cpg')
    z = zip(exp_names, it.product((False, True), (False, True)))
    return (ExpConfig(n, b, e) for (n, (b, e)) in z)

pprint(list(exp_configs()))
[ExpConfig(name='vpg', baseline=False, entropy=False),
 ExpConfig(name='epg', baseline=False, entropy=True),
 ExpConfig(name='bpg', baseline=True, entropy=False),
 ExpConfig(name='cpg', baseline=True, entropy=True)]

We'll save the training data from each experiment for plotting.

In [21]:
import pickle

def dump_training_data(data, filename):
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    with open(filename, mode='wb') as file:
        pickle.dump(data, file)
        
def load_training_data(filename):
    with open(filename, mode='rb') as file:
        return pickle.load(file)

Let's run the experiments! We'll run each configuration for a fixed number of episodes so that we can compare them.

Notes:

  1. Until your models start working, you can decrease the number of episodes for each experiment, or only run one experiment.
  2. The results will be saved in a file. To re-run the experiments, you can set force_run to True.
In [22]:
import math

exp_max_episodes = 4000

results = {}
training_data_filename = os.path.join('results', f'part1_exp.dat')

# Set to True to force re-run (careful! will delete old experiment results)
force_run = False

# Skip running if results file exists.
if os.path.isfile(training_data_filename) and not force_run:
    print(f'=== results file {training_data_filename} exists, skipping experiments.')
    results = load_training_data(training_data_filename)
    
else:
    for n, b, e in exp_configs():
        print(f'=== Experiment {n}')
        results[n] = train_pg(baseline=b, entropy=e, max_episodes=exp_max_episodes, post_batch_fn=None)
    dump_training_data(results, training_data_filename)
=== results file results/part1_exp.dat exists, skipping experiments.
In [23]:
def plot_experiment_results(results, fig=None):
    if fig is None:
        fig, _ = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(18,12))
    for i, plot_type in enumerate(('loss_p', 'baseline', 'loss_e', 'mean_reward')):
        ax = fig.axes[i]
        for exp_name, exp_res in results.items():
            if plot_type not in exp_res:
                continue
            ax.plot(exp_res['episode_num'], exp_res[plot_type], label=exp_name)
        ax.set_title(plot_type)
        ax.set_xlabel('episode')
        ax.legend()
    return fig
    
experiments_results_fig = plot_experiment_results(results)

You should see positive training dynamics in the graphs (reward going up). If you don't, use them to further update your model or hyperparams.

To pass the test, you'll need to get a best total mean reward of at least 10 in the fixed number of epochs using the combined loss. It's possible to get much higher (over 100).

In [24]:
best_cpg_mean_reward = max(results['cpg']['mean_reward'])
print(f'Best CPG mean reward: {best_cpg_mean_reward:.2f}')

test.assertGreater(best_cpg_mean_reward, 10)
Best CPG mean reward: 247.78

Now let's take a look at a gameplay video of our cpg model after the short training!

In [25]:
hp = hw4.answers.part1_pg_hyperparams()
p_net_cpg = hw4pg.PolicyNet.build_for_env(env, **hp)
p_net_cpg.load_state_dict(results['cpg']['model_state'])

env, n_steps, reward = hw4pg.PolicyAgent.monitor_episode(ENV_NAME, p_net_cpg)
print(f'{n_steps} steps, total reward: {reward:.2f}')
show_monitor_video(env)
76 steps, total reward: -54.57
Out[25]:

Advantage Actor-Critic (AAC)¶

We have seen that the policy-gradient loss can be interpreted as a log-likelihood of the policy term (selecting a specific action at a specific state), weighted by the future rewards of that choice of action.

However, naïvely weighting by rewards has significant drawbacks in terms of the variance of the resulting gradient. We addressed this by adding a simple baseline term which represented our "expected reward" so that we increase probability of actions leading to trajectories which exceed this expectation and vice-versa.

In this part we'll explore a more powerful baseline, which is the idea behind the AAC method.

The advantage function¶

Recall the definition of the state-value function $v_{\pi}(s)$ and action-value function $q_{\pi}(s,a)$:

$$ \begin{align} v_{\pi}(s) &= \E{g(\tau)|s_0 = s,\pi} \\ q_{\pi}(s,a) &= \E{g(\tau)|s_0 = s,a_0=a,\pi}. \end{align} $$

Both these functions represent the value of the state $s$. However, $v_\pi$ averages over the first action according to the policy, while $q_\pi$ fixes the first action and then continues according to the policy.

Their difference is known as the advantage function: $$ a_\pi(s,a) = q_\pi(s,a)-v_\pi(s). $$

If $a_\pi(s,a)>0$ it means that it's better (in expectation) to take action $a$ in state $s$ compared to the average action. In other words, $a_\pi(s,a)$ represents the advantage of using action $a$ in state $s$ compared to the others.

So far we have used an estimate for $q_\pi$ as our weighting term for the log-policy, with a fixed baseline per batch.

$$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$

Now, we will use the state value as a baseline, so that an estimate of the advantage function is our weighting term:

$$ \hat\grad\mathcal{L}_{\text{AAC}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-v_\pi(s_t)\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$

Intuitively, using the advantage function makes sense because it means we're weighting our policy's actions according to how advantageous they are compared to other possible actions.

But how will we know $v_\pi(s)$? We'll learn it of course, using another neural network. This is known as actor-critic learning. We simultaneously learn the policy (actor) and the value of states (critic). We'll treat it as a regression task: given a state $s_t$, our state-value network will output $\hat{v}_\pi(s_t)$, an estimate of the actual unknown state-value. Our regression targets will be the discounted rewards, $\hat{q}_{i,t}$ (see question 2), and we can use a simple MSE as the loss function, $$ \mathcal{L}_{\text{SV}} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0}\left(\hat{v}_\pi(s_t) - \hat{q}_{i,t}\right)^2. $$

Implementation¶

We'll build heavily on our implementation of the regular policy-gradient method, and just add a new model class and a new loss class, with a small modification to the agent.

Let's start with the model. It will accept a state, and return action scores (as before), but also the value of that state. You can experiment with a dual-head network that has a shared base, or implement two separate parts within the network.

TODO:

  1. Implement the model as the AACPolicyNet class in the hw4/rl_ac.py module.
  2. Set the hyperparameters in the part1_aac_hyperparams() function of the hw4.answers module.
In [26]:
import hw4.rl_ac as hw4ac

hp = hw4.answers.part1_aac_hyperparams()
pv_net = hw4ac.AACPolicyNet.build_for_env(env, device, **hp)
pv_net
Out[26]:
AACPolicyNet(
  (value_layer): Linear(in_features=128, out_features=1, bias=True)
  (base): Sequential(
    (0): Linear(in_features=8, out_features=128, bias=True)
    (1): ReLU()
  )
  (action_layer): Linear(in_features=128, out_features=4, bias=True)
)

TODO: Complete the implementation of the agent class, AACPolicyAgent, in the hw4/rl_ac.py module.

In [27]:
agent = hw4ac.AACPolicyAgent(env, pv_net, device)
exp = agent.step()

test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([ 0.0046,  1.4165,  0.4643,  0.2485, -0.0053, -0.1052,  0.0000,  0.0000]), action=2, reward=-2.9471781088557973, is_done=False)

TODO: Implement the AAC loss function as the class AACPolicyGradientLoss in the hw4/rl_ac.py module.

In [28]:
loss_fn_aac = hw4ac.AACPolicyGradientLoss(delta=1.)
test_state_values = torch.ones(test_action_scores.shape[0], 1)
loss_t, loss_dict = loss_fn_aac(test_batch, (test_action_scores, test_state_values))

print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['adv_m'], -30.841, delta=1e-2)
test.assertAlmostEqual(loss_t.item(), 1466.830, delta=1e-2)
loss_dict={'loss_v': 1517.0616455078125, 'adv_m': -30.84125328063965, 'loss_p': -50.23125076293945}

Experimentation¶

Let's run the same experiment as before, but with the AAC method and compare the results.

In [29]:
def train_aac(baseline=False, entropy=False, **train_kwargs):
    hp = hw4.answers.part1_aac_hyperparams()
    loss_fns = [hw4ac.AACPolicyGradientLoss(hp['delta']), hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta'])]
    return train_rl(hw4ac.AACPolicyAgent, hw4ac.AACPolicyNet, loss_fns, hp, **train_kwargs)
In [30]:
training_data_filename = os.path.join('results', f'part1_exp_aac.dat')

# Set to True to force re-run (careful, will delete old experiment results)
force_run = False

if os.path.isfile(training_data_filename) and not force_run:
    print(f'=== results file {training_data_filename} exists, skipping experiments.')
    results_aac = load_training_data(training_data_filename)
    
else:
    print(f'=== Running AAC experiment')
    training_data = train_aac(max_episodes=exp_max_episodes)
    results_aac = dict(aac=training_data)
    dump_training_data(results_aac, training_data_filename)
=== results file results/part1_exp_aac.dat exists, skipping experiments.
In [31]:
experiments_results_fig = plot_experiment_results(results)
plot_experiment_results(results_aac, fig=experiments_results_fig);

You should get better results with the AAC method, so this time the bar is higher (again, you should aim for a mean reward of 100+). Compare the graphs with combined PG method and see if they make sense.

In [32]:
best_aac_mean_reward = max(results_aac['aac']['mean_reward'])
print(f'Best AAC mean reward: {best_aac_mean_reward:.2f}')

test.assertGreater(best_aac_mean_reward, 50)
Best AAC mean reward: 160.01

Final model training and visualization¶

Now, using your best model and hyperparams, let's train model for much longer and see the performance. Just for fun, we'll also visualize an episode every now and then so that we can see how well the agent is playing.

TODO:

  • Run the following block to train.
  • Tweak model or hyperparams as necessary.
  • Aim for high mean reward, at least 150+. It's possible to get over 200.
  • When training is done and you're satisfied with the model's outputs, rename the checkpoint file by adding _final to the file name. This will cause the block to skip training and instead load your saved model when running the homework submission script. Note that your submission zip file will not include the checkpoint file. This is OK.
In [33]:
import IPython.display

CHECKPOINTS_FILE = f'checkpoints/{ENV_NAME}-ac.dat'
CHECKPOINTS_FILE_FINAL = f'checkpoints/{ENV_NAME}-ac_final.dat'
TARGET_REWARD = 125
MAX_EPISODES = 15_000

def post_batch_fn(batch_idx, p_net, batch, print_every=20, final=False):
    if not final and batch_idx % print_every != 0:
        return
    env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, p_net)
    html = show_monitor_video(env, width="500")
    IPython.display.clear_output(wait=True)
    print(f'Monitor@#{batch_idx}: n_steps={n_steps}, total_reward={reward:.3f}, final={final}')
    IPython.display.display_html(html)
    
    
if os.path.isfile(CHECKPOINTS_FILE_FINAL):
    print(f'=== {CHECKPOINTS_FILE_FINAL} exists, skipping training...')
    checkpoint_data = torch.load(CHECKPOINTS_FILE_FINAL)
    hp = hw4.answers.part1_aac_hyperparams()
    pv_net = hw4ac.AACPolicyNet.build_for_env(env, **hp)
    pv_net.load_state_dict(checkpoint_data['params'])
    print(f'=== Running best model...')
    env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, pv_net)
    print(f'=== Best model ran for {n_steps} steps. Total reward: {reward:.2f}')
    IPython.display.display_html(show_monitor_video(env))
    best_mean_reward = checkpoint_data["best_mean_reward"]
else:
    print(f'=== Starting training...')
    train_data = train_aac(TARGET_REWARD, max_episodes=MAX_EPISODES,
                           seed=None, checkpoints_file=CHECKPOINTS_FILE, post_batch_fn=post_batch_fn)
    print(f'=== Done, ', end='')
    best_mean_reward = train_data["best_mean_reward"][-1]
    print(f'num_episodes={train_data["episode_num"][-1]}, best_mean_reward={best_mean_reward:.1f}')
          
test.assertGreaterEqual(best_mean_reward, TARGET_REWARD)
Monitor@#1955: n_steps=300, total_reward=-34.716, final=True
#1955: step=03840825, loss_e= -0.12, m_reward(100)=  98.7 (best= 238.5): 100%|██████████| 15000/15000 [1:24:07<00:00,  2.97it/s]

=== STOPPING - Max episode reached
=== Done, num_episodes=15000, best_mean_reward=238.5

Questions¶

TODO: Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

In [38]:
from cs236781.answers import display_answer
import hw4.answers

Question 1¶

Explain qualitatively why subtracting a baseline in the policy-gradient helps reduce it's variance. Specifically, give an example where it helps.

In [39]:
display_answer(hw4.answers.part1_q1)

1.The policy we are using, the policy-gradient, is a log-likelihood weighted by the reward. If we were not to use the baseline, we will be measuring the trajectory's worth by his total reward. The gradient is affected by every action that is taken from the current state since the gradient is the discounted sum of every reward until the end of the episode. i.e:, let's say we have two trajectories after taking an action, each has a much different return value from the other, each individual return is going to be far from the true value function. That will lead to the fact that the variance of the gradient will be high. But, if we were to use the baseline, we will be normalizing the weights which will reduce the variance. Let’s take a more concrete example, consider two different trajectories: $T_1$ with total reward of 10, $T_2$ with total reward of 100. The average reward is standing at $55$. However, because the two rewards greater than zero, if we will not subtract a baseline we will get much higher variance then we would have gotten by using the normalized rewards, so that a below average result is not considered positive anymore.

Question 2¶

In AAC, when using the estimated q-values as regression targets for our state-values, why do we get a valid approximation? Hint: how is $v_\pi(s)$ expressed in terms of $q_\pi(s,a)$?

In [40]:
display_answer(hw4.answers.part1_q2)

2.Because $ v_{\pi} $ is the mean of the discounted rewards beginning from state $ s_{t}$ with the policy $\pi$, and $q_{\pi}$, is the same except the fact that first action is fixed, we can express as the average of the action-value function $ v_{\pi} $ for every action that is possible, weighted using the probability to choose the action, based on the current policy $\pi$. The estimated q values we are calculating are the sum of discounted rewards from the current state $ s_{t}$ based on the actions we have sampled. Because, we can not compute each one of the possible trajectories, we left to rely on the sampled trajectories, the further we will proceed in the learning process, the close this value will get to $ q_{\pi} $ and also $ v_{\pi} $.

Question 3¶

  1. Analyze and explain the graphs you got in first experiment run.
  2. Compare the experiment graphs you got with the AAC method to the regular PG method (cpg).
In [41]:
display_answer(hw4.answers.part1_q3)

3.1

As we can see, we succeed in learning with respect to all the parameters:

Firstly, in both epg and vpg, we can see the loss go up from a negative low results and finished with a close to zero values, that implies a successful learning process.Second of all, in cpg and bpg we cannot find any improvement regarding the loss, it is always zero because we took of the average value, yet, in the baseline graph we can see that the baseline also goes up so bpg and cpg have also succeeded. From that we can conclude that the learning process was successful in each one of the cases, and that we were able to achieve the results using a baseline. Last but not least, in every test, the mean reward starts from negative low rewards, which is caused by random action choosing, and goes all the way up to more than handrend. We got much better rewards in cpg, and therefore using baseline improved our learning process as explained in q1.

3.2 As we can see in the ACC model, the policy loss begins from low negative values, and climbs all the way up to almost zero. Moreover, we can see that the policy loss of ACC did it faster than the other models. Lets focus on the mean reward, using the cpg we achieve higher values (247 compared to 160). The reason for that is that ACC demands more time to converge because there are two learning processes. If we are training the ACC for a larger number of epochs, we expect to see improvement in the mean reward and better results than cpg.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$

Part 2: Variational Autoencoder¶

In this part we will learn to generate new data using a special type of autoencoder model which allows us to sample from its latent space. We'll implement and train a VAE and use it to generate new images.

In [1]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda

Obtaining the dataset¶

Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.

We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)

However, if you feel adventurous and/or prefer to generate something else, feel free to edit the PART2_CUSTOM_DATA_URL variable in hw4/answers.py.

In [3]:
import cs236781.plot as plot
import cs236781.download
from hw4.answers import PART2_CUSTOM_DATA_URL as CUSTOM_DATA_URL

DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
    DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
    DATA_URL = CUSTOM_DATA_URL

_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /home/assaflovton/.pytorch-datasets/lfw-bush.zip exists, skipping download.
Extracting /home/assaflovton/.pytorch-datasets/lfw-bush.zip...
Extracted 531 to /home/assaflovton/.pytorch-datasets/lfw/George_W_Bush

Create a Dataset object that will load the extraced images:

In [4]:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder

im_size = 64
tf = T.Compose([
    # Resize to constant spatial dimensions
    T.Resize((im_size, im_size)),
    # PIL.Image -> torch.Tensor
    T.ToTensor(),
    # Dynamic range [0,1] -> [-1, 1]
    T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])

ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)

OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.

In [5]:
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
In [6]:
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)

test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])

The Variational Autoencoder¶

An autoencoder is a model which learns a representation of data in an unsupervised fashion (i.e without any labels). Recall it's general form from the lecture:

An autoencoder maps an instance $\bb{x}$ to a latent-space representation $\bb{z}$. It has an encoder part, $\Phi_{\bb{\alpha}}(\bb{x})$ (a model with parameters $\bb{\alpha}$) and a decoder part, $\Psi_{\bb{\beta}}(\bb{z})$ (a model with parameters $\bb{\beta}$).

While autoencoders can learn useful representations, generally it's hard to use them as generative models because there's no distribution we can sample from in the latent space. In other words, we have no way to choose a point $\bb{z}$ in the latent space such that $\Psi(\bb{z})$ will end up on the data manifold in the instance space.

The variational autoencoder (VAE), first proposed by Kingma and Welling, addresses this issue by taking a probabilistic perspective. Briefly, a VAE model can be described as follows.

We define, in Baysean terminology,

  • The prior distribution $p(\bb{Z})$ on points in the latent space.
  • The posterior distribution of points in the latent spaces given a specific instance: $p(\bb{Z}|\bb{X})$.
  • The likelihood distribution of a sample $\bb{X}$ given a latent-space representation: $p(\bb{X}|\bb{Z})$.
  • The evidence distribution $p(\bb{X})$ which is the distribution of the instance space due to the generative process.

To create our variational decoder we'll further specify:

  • A parametric likelihood distribution, $p _{\bb{\beta}}(\bb{X} | \bb{Z}=\bb{z}) = \mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$. The interpretation is that given a latent $\bb{z}$, we map it to a point normally distributed around the point calculated by our decoder neural network. Note that here $\sigma^2$ is a hyperparameter while $\vec{\beta}$ represents the network parameters.
  • A fixed latent-space prior distribution of $p(\bb{Z}) = \mathcal{N}(\bb{0},\bb{I})$.

This setting allows us to generate a new instance $\bb{x}$ by sampling $\bb{z}$ from the multivariate normal distribution, obtaining the instance-space mean $\Psi _{\bb{\beta}}(\bb{z})$ using our decoder network, and then sampling $\bb{x}$ from $\mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$.

Our variational encoder will approximate the posterior with a parametric distribution $q _{\bb{\alpha}}(\bb{Z} | \bb{x}) = \mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$. The interpretation is that our encoder model, $\Phi_{\vec{\alpha}}(\bb{x})$, calculates the mean and variance of the posterior distribution, and samples $\bb{z}$ based on them. An important nuance here is that our network can't contain any stochastic elements that depend on the model parameters, otherwise we won't be able to back-propagate to those parameters. So sampling $\bb{z}$ from $\mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$ is not an option. The solution is to use what's known as the reparametrization trick: sample from an isotropic Gaussian, i.e. $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ (which doesn't depend on trainable parameters), and calculate the latent representation as $\bb{z} = \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{u}\odot\bb{\sigma}_{\bb{\alpha}}(\bb{x})$.

To train a VAE model, we maximize the evidence distribution, $p(\bb{X})$ (see question below). The VAE loss can therefore be stated as minimizing $\mathcal{L} = -\mathbb{E}_{\bb{x}} \log p(\bb{X})$. Although this expectation is intractable, we can obtain a lower-bound for $p(\bb{X})$ (the evidence lower bound, "ELBO", shown in the lecture):

$$ \log p(\bb{X}) \ge \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right] - \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{X})\,\left\|\, p(\bb{Z} )\right.\right) $$

where $ \mathcal{D} _{\mathrm{KL}}(q\left\|\right.p) = \mathbb{E}_{\bb{z}\sim q}\left[ \log \frac{q(\bb{Z})}{p(\bb{Z})} \right] $ is the Kullback-Liebler divergence, which can be interpreted as the information gained by using the posterior $q(\bb{Z|X})$ instead of the prior distribution $p(\bb{Z})$.

Using the ELBO, the VAE loss becomes, $$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} {\bb{x}} \left[ \mathbb{E} {\bb{z} \sim q {\bb{\alpha}} }\left[ -\log p {\bb{\beta}}(\bb{x} | \bb{z}) \right]

  • \mathcal{D} {\mathrm{KL}}\left(q {\bb{\alpha}}(\bb{Z} | \bb{x})\,\left|\, p(\bb{Z} )\right.\right) \right]. $$

By remembering that the likelihood is a Gaussian distribution with a diagonal covariance and by applying the reparametrization trick, we can write the above as

$$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} } \left[ \frac{1}{2\sigma^2}\left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$

Model Implementation¶

Obviously our model will have two parts, an encoder and a decoder. Since we're working with images, we'll implement both as deep convolutional networks, where the decoder is a "mirror image" of the encoder implemented with adjoint (AKA transposed) convolutions. Between the encoder CNN and the decoder CNN we'll implement the sampling from the parametric posterior approximator $q_{\bb{\alpha}}(\bb{Z}|\bb{x})$ to make it a VAE model and not just a regular autoencoder (of course, this is not yet enough to create a VAE, since we also need a special loss function which we'll get to later).

First let's implement just the CNN part of the Encoder network (this is not the full $\Phi_{\vec{\alpha}}(\bb{x})$ yet). As usual, it should take an input image and map to a activation volume of a specified depth. We'll consider this volume as the features we extract from the input image. Later we'll use these to create the latent space representation of the input.

TODO: Implement the EncoderCNN class in the hw4/autoencoder.py module. Implement any CNN architecture you like. If you need "architecture inspiration" you can see e.g. this or this paper.

In [7]:
import hw4.autoencoder as autoencoder

in_channels = 3
out_channels = 1024
encoder_cnn = autoencoder.EncoderCNN(in_channels, out_channels).to(device)
print(encoder_cnn)

h = encoder_cnn(x0)
print(h.shape)

test.assertEqual(h.dim(), 4)
test.assertSequenceEqual(h.shape[0:2], (1, out_channels))
EncoderCNN(
  (cnn): Sequential(
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU()
    (9): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (10): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU()
    (12): Conv2d(512, 1024, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (13): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
)
torch.Size([1, 1024, 2, 2])

Now let's implement the CNN part of the Decoder. Again this is not yet the full $\Psi _{\bb{\beta}}(\bb{z})$. It should take an activation volume produced by your EncoderCNN and output an image of the same dimensions as the Encoder's input was. This can be a CNN which is like a "mirror image" of the the Encoder. For example, replace convolutions with transposed convolutions, downsampling with up-sampling etc. Consult the documentation of ConvTranspose2D to figure out how to reverse your convolutional layers in terms of input and output dimensions. Note that the decoder doesn't have to be exactly the opposite of the encoder and you can experiment with using a different architecture.

TODO: Implement the DecoderCNN class in the hw4/autoencoder.py module.

In [8]:
decoder_cnn = autoencoder.DecoderCNN(in_channels=out_channels, out_channels=in_channels).to(device)
print(decoder_cnn)
x0r = decoder_cnn(h)
print(x0r.shape)

test.assertEqual(x0.shape, x0r.shape)

# Should look like colored noise
T.functional.to_pil_image(x0r[0].cpu().detach())
DecoderCNN(
  (cnn): Sequential(
    (0): ConvTranspose2d(1024, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU()
    (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU()
    (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
  )
)
torch.Size([1, 3, 64, 64])
Out[8]:

Let's now implement the full VAE Encoder, $\Phi_{\vec{\alpha}}(\vec{x})$. It will work as follows:

  1. Produce a feature vector $\vec{h}$ from the input image $\vec{x}$.
  2. Use two affine transforms to convert the features into the mean and log-variance of the posterior, i.e. $$ \begin{align}
     \bb{\mu} _{\bb{\alpha}}(\bb{x}) &= \vec{h}\mattr{W}_{\mathrm{h\mu}} + \vec{b}_{\mathrm{h\mu}} \\
     \log\left(\bb{\sigma}^2_{\bb{\alpha}}(\bb{x})\right) &= \vec{h}\mattr{W}_{\mathrm{h\sigma^2}} + \vec{b}_{\mathrm{h\sigma^2}}
    
    \end{align} $$
  3. Use the reparametrization trick to create the latent representation $\vec{z}$.

Notice that we model the log of the variance, not the actual variance. The above formulation is proposed in appendix C of the VAE paper.

TODO: Implement the encode() method in the VAE class within the hw4/autoencoder.py module. You'll also need to define your parameters in __init__().

In [9]:
z_dim = 2
vae = autoencoder.VAE(encoder_cnn, decoder_cnn, x0[0].size(), z_dim).to(device)
print(vae)

z, mu, log_sigma2 = vae.encode(x0)

test.assertSequenceEqual(z.shape, (1, z_dim))
test.assertTrue(z.shape == mu.shape == log_sigma2.shape)

print(f'mu(x0)={list(*mu.detach().cpu().numpy())}, sigma2(x0)={list(*torch.exp(log_sigma2).detach().cpu().numpy())}')
VAE(
  (features_encoder): EncoderCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
      (12): Conv2d(512, 1024, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (13): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (features_decoder): DecoderCNN(
    (cnn): Sequential(
      (0): ConvTranspose2d(1024, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
      (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
  )
  (modu): Linear(in_features=4096, out_features=2, bias=True)
  (sig): Linear(in_features=4096, out_features=2, bias=True)
  (z_layer): Linear(in_features=2, out_features=4096, bias=True)
)
mu(x0)=[-0.47476906, -0.21610819], sigma2(x0)=[1.3371278, 1.5939903]

Let's sample some 2d latent representations for an input image x0 and visualize them.

In [10]:
# Sample from q(Z|x)
N = 500
Z = torch.zeros(N, z_dim)
_, ax = plt.subplots()
with torch.no_grad():
    for i in range(N):
        Z[i], _, _ = vae.encode(x0)
        ax.scatter(*Z[i].cpu().numpy())

# Should be close to the mu/sigma in the previous block above
print('sampled mu', torch.mean(Z, dim=0))
print('sampled sigma2', torch.var(Z, dim=0))
sampled mu tensor([-0.4394, -0.1960])
sampled sigma2 tensor([1.3443, 1.7524])

Let's now implement the full VAE Decoder, $\Psi _{\bb{\beta}}(\bb{z})$. It will work as follows:

  1. Produce a feature vector $\tilde{\vec{h}}$ from the latent vector $\vec{z}$ using an affine transform.
  2. Reconstruct an image $\tilde{\vec{x}}$ from $\tilde{\vec{h}}$ using the decoder CNN.

TODO: Implement the decode() method in the VAE class within the hw4/autoencoder.py module. You'll also need to define your parameters in __init__(). You may need to also re-run the block above after you implement this.

In [11]:
x0r = vae.decode(z)

test.assertSequenceEqual(x0r.shape, x0.shape)

Our model's forward() function will simply return decode(encode(x)) as well as the calculated mean and log-variance of the posterior.

In [12]:
x0r, mu, log_sigma2 = vae(x0)

test.assertSequenceEqual(x0r.shape, x0.shape)
test.assertSequenceEqual(mu.shape, (1, z_dim))
test.assertSequenceEqual(log_sigma2.shape, (1, z_dim))
T.functional.to_pil_image(x0r[0].detach().cpu())
Out[12]:

Loss Implementation¶

In practice, since we're using SGD, we'll drop the expectation over $\bb{X}$ and instead sample an instance from the training set and compute a point-wise loss. Similarly, we'll drop the expectation over $\bb{Z}$ by sampling from $q_{\vec{\alpha}}(\bb{Z}|\bb{x})$. Additionally, because the KL divergence is between two Gaussian distributions, there is a closed-form expression for it. These points bring us to the following point-wise loss:

$$ \ell(\vec{\alpha},\vec{\beta};\bb{x}) = \frac{1}{\sigma^2 d_x} \left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 + \mathrm{tr}\,\bb{\Sigma} _{\bb{\alpha}}(\bb{x}) + \|\bb{\mu} _{\bb{\alpha}}(\bb{x})\|^2 _2 - d_z - \log\det \bb{\Sigma} _{\bb{\alpha}}(\bb{x}), $$

where $d_z$ is the dimension of the latent space, $d_x$ is the dimension of the input and $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$. This pointwise loss is the quantity that we'll compute and minimize with gradient descent. The first term corresponds to the data-reconstruction loss, while the second term corresponds to the KL-divergence loss. Note that the scaling by $d_x$ is not derived from the original loss formula and was added directly to the pointwise loss just to normalize the data term.

TODO: Implement the vae_loss() function in the hw4/autoencoder.py module.

In [13]:
from hw4.autoencoder import vae_loss
torch.manual_seed(42)

def test_vae_loss():
    # Test data
    N, C, H, W = 10, 3, 64, 64 
    z_dim = 32
    x  = torch.randn(N, C, H, W)*2 - 1
    xr = torch.randn(N, C, H, W)*2 - 1
    z_mu = torch.randn(N, z_dim)
    z_log_sigma2 = torch.randn(N, z_dim)
    x_sigma2 = 0.9
    
    loss, _, _ = vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
    
    test.assertAlmostEqual(loss.item(), 58.3234367, delta=1e-3)
    return loss

test_vae_loss()
Out[13]:
tensor(58.3234)

Sampling¶

The main advantage of a VAE is that it can by used as a generative model by sampling the latent space, since we optimize for a isotropic Gaussian prior $p(\bb{Z})$ in the loss function. Let's now implement this so that we can visualize how our model is doing when we train.

TODO: Implement the sample() method in the VAE class within the hw4/autoencoder.py module.

In [14]:
samples = vae.sample(5)
_ = plot.tensors_as_images(samples)

Training¶

Time to train!

TODO:

  1. Implement the VAETrainer class in the hw4/training.py module. Make sure to implement the checkpoints feature of the Trainer class if you haven't done so already in Part 1.
  2. Tweak the hyperparameters in the part2_vae_hyperparams() function within the hw4/answers.py module.
In [15]:
import torch.optim as optim
from torch.utils.data import random_split
from torch.utils.data import DataLoader
from torch.nn import DataParallel
from hw4.training import VAETrainer
from hw4.answers import part2_vae_hyperparams

torch.manual_seed(42)

# Hyperparams
hp = part2_vae_hyperparams()
batch_size = hp['batch_size']
h_dim = hp['h_dim']
z_dim = hp['z_dim']
x_sigma2 = hp['x_sigma2']
learn_rate = hp['learn_rate']
betas = hp['betas']

# Data
split_lengths = [int(len(ds_gwb)*0.9), int(len(ds_gwb)*0.1)]
ds_train, ds_test = random_split(ds_gwb, split_lengths)
dl_train = DataLoader(ds_train, batch_size, shuffle=True)
dl_test  = DataLoader(ds_test,  batch_size, shuffle=True)
im_size = ds_train[0][0].shape

# Model
encoder = autoencoder.EncoderCNN(in_channels=im_size[0], out_channels=h_dim)
decoder = autoencoder.DecoderCNN(in_channels=h_dim, out_channels=im_size[0])
vae = autoencoder.VAE(encoder, decoder, im_size, z_dim)
vae_dp = DataParallel(vae).to(device)

# Optimizer
optimizer = optim.Adam(vae.parameters(), lr=learn_rate, betas=betas)

# Loss
def loss_fn(x, xr, z_mu, z_log_sigma2):
    return autoencoder.vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)

# Trainer
trainer = VAETrainer(vae_dp, loss_fn, optimizer, device)
checkpoint_file = 'checkpoints/vae'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
    os.remove(f'{checkpoint_file}.pt')

# Show model and hypers
print(vae)
print(hp)
VAE(
  (features_encoder): EncoderCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
      (12): Conv2d(512, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (13): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (features_decoder): DecoderCNN(
    (cnn): Sequential(
      (0): ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
      (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
  )
  (modu): Linear(in_features=512, out_features=64, bias=True)
  (sig): Linear(in_features=512, out_features=64, bias=True)
  (z_layer): Linear(in_features=64, out_features=512, bias=True)
)
{'batch_size': 16, 'h_dim': 128, 'z_dim': 64, 'x_sigma2': 0.005, 'learn_rate': 0.0001, 'betas': (0.8, 0.99)}

TODO:

  1. Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
  2. When you're satisfied with your results, rename the checkpoints file by adding _final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.

The images you get should be colorful, with different backgrounds and poses.

In [16]:
import IPython.display

def post_epoch_fn(epoch, train_result, test_result, verbose):
    # Plot some samples if this is a verbose epoch
    if verbose:
        samples = vae.sample(n=5)
        fig, _ = plot.tensors_as_images(samples, figsize=(6,2))
        IPython.display.display(fig)
        plt.close(fig)

if os.path.isfile(f'{checkpoint_file_final}.pt'):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
    checkpoint_file = checkpoint_file_final
else:
    res = trainer.fit(dl_train, dl_test,
                      num_epochs=200, early_stopping=25, print_every=10,
                      checkpoints=checkpoint_file,
                      post_epoch_fn=post_epoch_fn)

saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
vae_dp.load_state_dict(saved_state['model_state'])
print('*** Images Generated from best model:')
fig, _ = plot.tensors_as_images(vae_dp.module.sample(n=15), nrows=3, figsize=(6,6))
--- EPOCH 1/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt
*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 11/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt
*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 21/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 31/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 41/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt

*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 51/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 61/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 71/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
--- EPOCH 81/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt
--- EPOCH 91/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
--- EPOCH 101/200 ---
train_batch:   0%|          | 0/30 [00:00<?, ?it/s]
test_batch:   0%|          | 0/4 [00:00<?, ?it/s]
*** Saved checkpoint checkpoints/vae.pt
*** Images Generated from best model:

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

In [22]:
from cs236781.answers import display_answer
import hw4.answers as answers

Question 1¶

What does the $\sigma^2$ hyperparameter (x_sigma2 in the code) do? Explain the effect of low and high values.

In [23]:
display_answer(answers.part2_q1)

1. The x_sigma2 parameter is responsible for the similarity to the dataset, it determines how creative our model would be, meaning it affects the ability of our model to generate pictures as opposed to copying them. From looking at the loss function we can see that using small values in our model, will generate ‘copies' meaning images identical to what already can be found in the dataset (in order to minimize the loss), that will decrease the ability of our model to generate “new” images. Using large values, the weight assigned to the difference between the input image and the generated one, will result in a model that tries to create its own images rather than just copying them. If we were to use very large values, that will result in a generation of unrecognizable images, since there is no tuning that “pushes” it to be similar to the images of president Bush.

Question 2¶

  1. Explain the purpose of both parts of the VAE loss term - reconstruction loss and KL divergence loss.
  2. How is the latent-space distribution affected by the KL loss term?
  3. What's the benefit of this effect?
In [24]:
display_answer(answers.part2_q2)

2.1

The KL-divergence loss is responsible for matching the input space distribution with the latent space, it calculates how similar the two distributions are. The reconstruction loss is responsible for calculating the difference between the input image and the predicted image (above average case). The similar the prediction gets to the dataset, the smaller the reconstruction loss gets.

2.2

The effect of the KL-loss term on the latent space is that it is responsible for making the distribution of the latent space similar to the sampled images distribution space. It is achieved by making σ and μ of the latent space similar to the σ and μ of the sample space.

2.3

The benefit of this effect is that it allows us to create an image generator that does not copy the image from the dataset. We achieve a smaller KL loss when the probability is similar to the training data’s distribution. Therefore the addition of the KL loss to the reconstruction loss means that our generator will not try to copy images, yet it will try to minimize both the KL-loss and the reconstruction loss that will lead to similarity in the distributions.

Question 3¶

In the formulation of the VAE loss, why do we start by maximizing the evidence distribution, $p(\bb{X})$?

In [25]:
display_answer(answers.part2_q3)

3. The evidence distribution is - p(X)=∫p(X|z)p(z)dz We can see that maximining the evidence means that for every z in the latent space we have a valid representative image in the instance space. This representation of the optimization problem allows us to better define our encoding.

Question 4¶

In the VAE encoder, why do we model the log of the latent-space variance corresponding to an input, $\bb{\sigma}^2_{\bb{\alpha}}$, instead of directly modelling this variance?

In [26]:
display_answer(answers.part2_q4)

The main advantage of using a log scale instead of modeling this variance directly is that it mitigates noise. As we know from the behavior of the logarithmic function, the logarithmic scale allows us to make small differences less noticeable, which improves stability and better learning of our model.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$

Part 3: Generative Adversarial Networks¶

In this part we will implement and train a generative adversarial network and apply it to the task of image generation.

In [1]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda

Obtaining the dataset¶

We'll use the same data as in Part 2.

But again, you can use a custom dataset, by editing the PART3_CUSTOM_DATA_URL variable in hw4/answers.py.

In [2]:
import cs236781.plot as plot
import cs236781.download
from hw4.answers import PART3_CUSTOM_DATA_URL as CUSTOM_DATA_URL

DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
    DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
    DATA_URL = CUSTOM_DATA_URL

_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /home/assaflovton/.pytorch-datasets/lfw-bush.zip exists, skipping download.
Extracting /home/assaflovton/.pytorch-datasets/lfw-bush.zip...
Extracted 531 to /home/assaflovton/.pytorch-datasets/lfw/George_W_Bush

Create a Dataset object that will load the extraced images:

In [3]:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder

im_size = 64
tf = T.Compose([
    # Resize to constant spatial dimensions
    T.Resize((im_size, im_size)),
    # PIL.Image -> torch.Tensor
    T.ToTensor(),
    # Dynamic range [0,1] -> [-1, 1]
    T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])

ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)

OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.

In [4]:
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
In [5]:
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)

test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])

Generative Adversarial Nets (GANs)¶

GANs, first proposed in a paper by Ian Goodfellow in 2014 are today arguably the most popular type of generative model. GANs are currently producing state of the art results in generative tasks over many different domains.

In a GAN model, two different neural networks compete against each other: A generator and a discriminator.

  • The Generator, which we'll denote as $\Psi _{\bb{\gamma}} : \mathcal{U} \rightarrow \mathcal{X}$, maps a latent-space variable $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ to an instance-space variable $\bb{x}$ (e.g. an image). Thus a parametric evidence distribution $p_{\bb{\gamma}}(\bb{X})$ is generated, which we typically would like to be as close as possible to the real evidence distribution, $p(\bb{X})$.

  • The Discriminator, $\Delta _{\bb{\delta}} : \mathcal{X} \rightarrow [0,1]$, is a network which, given an instance-space variable $\bb{x}$, returns the probability that $\bb{x}$ is real, i.e. that $\bb{x}$ was sampled from $p(\bb{X})$ and not $p_{\bb{\gamma}}(\bb{X})$.

Training GANs¶

The generator is trained to generate "fake" instances which will maximally fool the discriminator into returning that they're real. Mathematically, the generator's parameters $\bb{\gamma}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

The discriminator is trained to classify between real images, coming from the training set, and fake images generated by the generator. Mathematically, the discriminator's parameters $\bb{\delta}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

These two competing objectives can thus be expressed as the following min-max optimization: $$ \min _{\bb{\gamma}} \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

A key insight into GANs is that we can interpret the above maximum as the loss with respect to $\bb{\gamma}$:

$$ L({\bb{\gamma}}) = \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

This means that the generator's loss function trains together with the generator itself in an adversarial manner. In contrast, when training our VAE we used a fixed L2 norm as a data loss term.

Model Implementation¶

We'll now implement a Deep Convolutional GAN (DCGAN) model. See the DCGAN paper for architecture ideas and tips for training.

TODO: Implement the Discriminator class in the hw4/gan.py module. If you wish you can reuse the EncoderCNN class from the VAE model as the first part of the Discriminator.

In [6]:
import hw4.gan as gan

dsc = gan.Discriminator(in_size=x0[0].shape).to(device)
print(dsc)

d0 = dsc(x0)
print(d0.shape)

test.assertSequenceEqual(d0.shape, (1,1))
Discriminator(
  (cnn): Sequential(
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): ReLU()
    (4): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (5): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (8): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU()
    (10): Conv2d(512, 1, kernel_size=(4, 4), stride=(2, 2))
  )
)
torch.Size([1, 1])

TODO: Implement the Generator class in the hw4/gan.py module. If you wish you can reuse the DecoderCNN class from the VAE model as the last part of the Generator.

In [7]:
z_dim = 128
gen = gan.Generator(z_dim, 4).to(device)
print(gen)

z = torch.randn(1, z_dim).to(device)
xr = gen(z)
print(xr.shape)

test.assertSequenceEqual(x0.shape, xr.shape)
Generator(
  (cnn): Sequential(
    (0): ConvTranspose2d(128, 512, kernel_size=(4, 4), stride=(2, 2))
    (1): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): ReLU()
    (4): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (8): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU()
    (10): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (11): Tanh()
  )
)
torch.Size([1, 3, 64, 64])

Loss Implementation¶

Let's begin with the discriminator's loss function. Based on the above we can flip the sign and say we want to update the Discriminator's parameters $\bb{\delta}$ so that they minimize the expression $$

  • \mathbb{E} {\bb{x} \sim p(\bb{X}) } \log \Delta {\bb{\delta}}(\bb{x}) \, - \, \mathbb{E} {\bb{z} \sim p(\bb{Z}) } \log (1-\Delta {\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

We're using the Discriminator twice in this expression; once to classify data from the real data distribution and once again to classify generated data. Therefore our loss should be computed based on these two terms. Notice that since the discriminator returns a probability, we can formulate the above as two cross-entropy losses.

GANs are notoriously diffucult to train. One common trick for improving GAN stability during training is to make the classification labels noisy for the discriminator. This can be seen as a form of regularization, to help prevent the discriminator from overfitting.

We'll incorporate this idea into our loss function. Instead of labels being equal to 0 or 1, we'll make them "fuzzy", i.e. random numbers in the ranges $[0\pm\epsilon]$ and $[1\pm\epsilon]$.

TODO: Implement the discriminator_loss_fn() function in the hw4/gan.py module.

In [8]:
from hw4.gan import discriminator_loss_fn
torch.manual_seed(42)

y_data = torch.rand(10) * 10
y_generated = torch.rand(10) * 10

loss = discriminator_loss_fn(y_data, y_generated, data_label=1, label_noise=0.3)
print(loss)

test.assertAlmostEqual(loss.item(), 6.4808731, delta=1e-5)
tensor(6.4809)

Similarly, the generator's parameters $\bb{\gamma}$ should minimize the expression $$ -\mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )) $$

which can also be seen as a cross-entropy term. This corresponds to "fooling" the discriminator; Notice that the gradient of the loss w.r.t $\bb{\gamma}$ using this expression also depends on $\bb{\delta}$.

TODO: Implement the generator_loss_fn() function in the hw4/gan.py module.

In [9]:
from hw4.gan import generator_loss_fn
torch.manual_seed(42)

y_generated = torch.rand(20) * 10

loss = generator_loss_fn(y_generated, data_label=1)
print(loss)

test.assertAlmostEqual(loss.item(), 0.0222969, delta=1e-3)
tensor(0.0223)

Sampling¶

Sampling from a GAN is straightforward, since it learns to generate data from an isotropic Gaussian latent space distribution.

There is an important nuance however. Sampling is required during the process of training the GAN, since we generate fake images to show the discriminator. As you'll seen in the next section, in some cases we'll need our samples to have gradients (i.e., to be part of the Generator's computation graph).

TODO: Implement the sample() method in the Generator class within the hw4/gan.py module.

In [10]:
samples = gen.sample(5, with_grad=False)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNone(samples.grad_fn)
_ = plot.tensors_as_images(samples.cpu())

samples = gen.sample(5, with_grad=True)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNotNone(samples.grad_fn)

Training¶

Training GANs is a bit different since we need to train two models simultaneously, each with it's own separate loss function and optimizer. We'll implement the training logic as a function that handles one batch of data and updates both the discriminator and the generator based on it.

As mentioned above, GANs are considered hard to train. To get some ideas and tips you can see this paper, this list of "GAN hacks" or just do it the hard way :)

TODO:

  1. Implement the train_batch function in the hw4/gan.py module.
  2. Tweak the hyperparameters in the part3_gan_hyperparams() function within the hw4/answers.py module.
In [11]:
import torch.optim as optim
from torch.utils.data import DataLoader
from hw4.answers import part3_gan_hyperparams

torch.manual_seed(42)

# Hyperparams
hp = part3_gan_hyperparams()
batch_size = hp['batch_size']
z_dim = hp['z_dim']

# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape

# Model
dsc = gan.Discriminator(im_size).to(device)
gen = gan.Generator(z_dim, featuremap_size=4).to(device)

# Optimizer
def create_optimizer(model_params, opt_params):
    opt_params = opt_params.copy()
    optimizer_type = opt_params['type']
    opt_params.pop('type')
    return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen.parameters(), hp['generator_optimizer'])

# Loss
def dsc_loss_fn(y_data, y_generated):
    return gan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])

def gen_loss_fn(y_generated):
    return gan.generator_loss_fn(y_generated, hp['data_label'])

# Training
checkpoint_file = 'checkpoints/gan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
    os.remove(f'{checkpoint_file}.pt')

# Show hypers
print(hp)
{'batch_size': 4, 'z_dim': 128, 'data_label': 0, 'label_noise': 0.3, 'discriminator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.6, 0.998)}, 'generator_optimizer': {'type': 'Adam', 'lr': 0.0002, 'betas': (0.6, 0.998)}}

TODO:

  1. Implement the save_checkpoint function in the hw4.gan module. You can decide on your own criterion regarding whether to save a checkpoint at the end of each epoch.
  2. Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
  3. When you're satisfied with your results, rename the checkpoints file by adding _final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.
In [12]:
import IPython.display
import tqdm
from hw4.gan import train_batch, save_checkpoint

num_epochs = 100

if os.path.isfile(f'{checkpoint_file_final}.pt'):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
    num_epochs = 0
    gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device,)
    checkpoint_file = checkpoint_file_final

try:
    dsc_avg_losses, gen_avg_losses = [], []
    for epoch_idx in range(num_epochs):
        # We'll accumulate batch losses and show an average once per epoch.
        dsc_losses, gen_losses = [], []
        print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')

        with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
            for batch_idx, (x_data, _) in enumerate(dl_train):
                x_data = x_data.to(device)
                dsc_loss, gen_loss = train_batch(
                    dsc, gen,
                    dsc_loss_fn, gen_loss_fn,
                    dsc_optimizer, gen_optimizer,
                    x_data)
                dsc_losses.append(dsc_loss)
                gen_losses.append(gen_loss)
                pbar.update()

        dsc_avg_losses.append(np.mean(dsc_losses))
        gen_avg_losses.append(np.mean(gen_losses))
        print(f'Discriminator loss: {dsc_avg_losses[-1]}')
        print(f'Generator loss:     {gen_avg_losses[-1]}')
        
        if save_checkpoint(gen, dsc_avg_losses, gen_avg_losses, checkpoint_file):
            print(f'Saved checkpoint.')
            

        samples = gen.sample(5, with_grad=False)
        fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
        IPython.display.display(fig)
        plt.close(fig)
except KeyboardInterrupt as e:
    print('\n *** Training interrupted by user')
--- EPOCH 1/100 ---
100%|██████████| 133/133 [00:03<00:00, 41.51it/s]
Discriminator loss: 0.2190957978591883
Generator loss:     7.581579744367671
--- EPOCH 2/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.60it/s]
Discriminator loss: 0.375993627046508
Generator loss:     6.456028620103248
--- EPOCH 3/100 ---
100%|██████████| 133/133 [00:02<00:00, 44.74it/s]
Discriminator loss: 0.5576723007611314
Generator loss:     4.919610567558977
--- EPOCH 4/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.75it/s]
Discriminator loss: 0.5034179173521978
Generator loss:     4.153902232646942
Saved checkpoint.
--- EPOCH 5/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.97it/s]
Discriminator loss: 0.5655487754515239
Generator loss:     3.8003001688118268
--- EPOCH 6/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.74it/s]
Discriminator loss: 0.5436478135547131
Generator loss:     3.691675647308952
Saved checkpoint.
--- EPOCH 7/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.29it/s]
Discriminator loss: 0.5873158931228003
Generator loss:     3.335433780698848
--- EPOCH 8/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.95it/s]
Discriminator loss: 0.5524043802704129
Generator loss:     3.3992904854896375
--- EPOCH 9/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.18it/s]
Discriminator loss: 0.5420843870904213
Generator loss:     3.2608735888524163
Saved checkpoint.
--- EPOCH 10/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.66it/s]
Discriminator loss: 0.5347825932715621
Generator loss:     3.347691211485325
--- EPOCH 11/100 ---
100%|██████████| 133/133 [00:02<00:00, 44.47it/s]
Discriminator loss: 0.4874892494732276
Generator loss:     3.495101801434854
--- EPOCH 12/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.88it/s]
Discriminator loss: 0.5117043137438315
Generator loss:     3.399857797568902
--- EPOCH 13/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.39it/s]
Discriminator loss: 0.4945266580391199
Generator loss:     3.6121438508643244
--- EPOCH 14/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.24it/s]
Discriminator loss: 0.4467761474976638
Generator loss:     3.858880040340854
--- EPOCH 15/100 ---
100%|██████████| 133/133 [00:03<00:00, 41.06it/s]
Discriminator loss: 0.3476050636850129
Generator loss:     3.7566017315800027
Saved checkpoint.
--- EPOCH 16/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.25it/s]
Discriminator loss: 0.42322863705624314
Generator loss:     3.827088889322783
--- EPOCH 17/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.11it/s]
Discriminator loss: 0.4248798497693431
Generator loss:     3.9950358894534577
--- EPOCH 18/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.47it/s]
Discriminator loss: 0.34464207850396633
Generator loss:     4.043871043319989
--- EPOCH 19/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.89it/s]
Discriminator loss: 0.3207206403682555
Generator loss:     4.251064271855175
--- EPOCH 20/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.33it/s]
Discriminator loss: 0.40224112644511506
Generator loss:     4.023276031913614
--- EPOCH 21/100 ---
100%|██████████| 133/133 [00:02<00:00, 44.36it/s]
Discriminator loss: 0.3072650340154655
Generator loss:     4.204044808122449
--- EPOCH 22/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.51it/s]
Discriminator loss: 0.33891547269615013
Generator loss:     4.3637653929846625
--- EPOCH 23/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.92it/s]
Discriminator loss: 0.27845915401340426
Generator loss:     4.331892930475393
Saved checkpoint.
--- EPOCH 24/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.18it/s]
Discriminator loss: 0.20527855306863785
Generator loss:     4.587076257942314
--- EPOCH 25/100 ---
100%|██████████| 133/133 [00:03<00:00, 39.64it/s]
Discriminator loss: 0.2676892869622636
Generator loss:     5.070718639775326
--- EPOCH 26/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.34it/s]
Discriminator loss: 0.3460392777744989
Generator loss:     4.746765232623968
--- EPOCH 27/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.72it/s]
Discriminator loss: 0.22216059471991725
Generator loss:     4.94165923989805
--- EPOCH 28/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.51it/s]
Discriminator loss: 0.40187181521179083
Generator loss:     4.894495808092275
--- EPOCH 29/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.43it/s]
Discriminator loss: 0.2473894180379864
Generator loss:     4.592519339314081
Saved checkpoint.
--- EPOCH 30/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.88it/s]
Discriminator loss: 0.21068276927099192
Generator loss:     4.901278176702055
--- EPOCH 31/100 ---
100%|██████████| 133/133 [00:04<00:00, 31.17it/s]
Discriminator loss: 0.2542778781538171
Generator loss:     5.266972168047626
--- EPOCH 32/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.30it/s]
Discriminator loss: 0.2982353298623759
Generator loss:     5.064995309464018
--- EPOCH 33/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.39it/s]
Discriminator loss: 0.20701568949043303
Generator loss:     5.013064954513894
Saved checkpoint.
--- EPOCH 34/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.77it/s]
Discriminator loss: 0.24608724107755756
Generator loss:     5.330189539973897
--- EPOCH 35/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.43it/s]
Discriminator loss: 0.2006522306010015
Generator loss:     5.055398975099836
Saved checkpoint.
--- EPOCH 36/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.22it/s]
Discriminator loss: 0.16020591050050312
Generator loss:     5.413037953520179
--- EPOCH 37/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.61it/s]
Discriminator loss: 0.26507122318883586
Generator loss:     5.219580284634927
--- EPOCH 38/100 ---
100%|██████████| 133/133 [00:03<00:00, 40.67it/s]
Discriminator loss: 0.2407345438194006
Generator loss:     5.529551411481728
--- EPOCH 39/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.86it/s]
Discriminator loss: 0.22167955882343135
Generator loss:     5.2652883184583565
Saved checkpoint.
--- EPOCH 40/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.23it/s]
Discriminator loss: 0.1654590068567068
Generator loss:     5.793898975042472
--- EPOCH 41/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.94it/s]
Discriminator loss: 0.286294582643007
Generator loss:     5.647861755432043
--- EPOCH 42/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.27it/s]
Discriminator loss: 0.2045082821741812
Generator loss:     5.232755192240378
Saved checkpoint.
--- EPOCH 43/100 ---
100%|██████████| 133/133 [00:02<00:00, 44.38it/s]
Discriminator loss: 0.24725696923477308
Generator loss:     5.421557304554415
--- EPOCH 44/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.23it/s]
Discriminator loss: 0.12218202471284938
Generator loss:     5.667823205316873
--- EPOCH 45/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.08it/s]
Discriminator loss: 0.10447923886708747
Generator loss:     5.7298084425746945
--- EPOCH 46/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.71it/s]
Discriminator loss: 0.20784733884204598
Generator loss:     6.694630064910516
--- EPOCH 47/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.24it/s]
Discriminator loss: 0.2504299006571895
Generator loss:     5.587808469184359
--- EPOCH 48/100 ---
100%|██████████| 133/133 [00:03<00:00, 41.48it/s]
Discriminator loss: 0.1731084656054364
Generator loss:     5.431740377182351
Saved checkpoint.
--- EPOCH 49/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.02it/s]
Discriminator loss: 0.2818260106041019
Generator loss:     6.002270985366707
--- EPOCH 50/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.88it/s]
Discriminator loss: 0.19518965844036942
Generator loss:     5.508424951617879
Saved checkpoint.
--- EPOCH 51/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.52it/s]
Discriminator loss: 0.17600683271324724
Generator loss:     5.476574107220299
Saved checkpoint.
--- EPOCH 52/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.04it/s]
Discriminator loss: 0.19142048839563713
Generator loss:     5.817653193509669
--- EPOCH 53/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.60it/s]
Discriminator loss: 0.07965643715141411
Generator loss:     5.6298859961947105
Saved checkpoint.
--- EPOCH 54/100 ---
100%|██████████| 133/133 [00:02<00:00, 44.38it/s]
Discriminator loss: 0.10593220356263612
Generator loss:     6.3436317730667
--- EPOCH 55/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.88it/s]
Discriminator loss: 0.21936285462146415
Generator loss:     6.302146280618539
--- EPOCH 56/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.98it/s]
Discriminator loss: 0.17182947577614532
Generator loss:     6.191843017599637
Saved checkpoint.
--- EPOCH 57/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.98it/s]
Discriminator loss: 0.18263180324233563
Generator loss:     6.393243289531622
--- EPOCH 58/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.69it/s]
Discriminator loss: 0.126353662124926
Generator loss:     7.208476542530203
--- EPOCH 59/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.61it/s]
Discriminator loss: 0.1522176316834258
Generator loss:     6.507055119464272
--- EPOCH 60/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.05it/s]
Discriminator loss: 0.1952736633164542
Generator loss:     6.265660646266507
--- EPOCH 61/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.65it/s]
Discriminator loss: 0.16389625418679157
Generator loss:     6.38678354786751
--- EPOCH 62/100 ---
100%|██████████| 133/133 [00:02<00:00, 44.73it/s]
Discriminator loss: 0.07430829205795338
Generator loss:     6.422216012065572
--- EPOCH 63/100 ---
100%|██████████| 133/133 [00:02<00:00, 44.64it/s]
Discriminator loss: 0.13146792318587913
Generator loss:     6.912555915072448
--- EPOCH 64/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.16it/s]
Discriminator loss: 0.2114155925810337
Generator loss:     6.676525912786785
--- EPOCH 65/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.17it/s]
Discriminator loss: 0.14953523476545075
Generator loss:     6.72313580387517
--- EPOCH 66/100 ---
100%|██████████| 133/133 [00:03<00:00, 44.16it/s]
Discriminator loss: 0.13229444429726528
Generator loss:     6.521695230240212
Saved checkpoint.
--- EPOCH 67/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.27it/s]
Discriminator loss: 0.11208787313977578
Generator loss:     6.849641455743546
--- EPOCH 68/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.97it/s]
Discriminator loss: 0.14996661894247496
Generator loss:     6.632800337067224
--- EPOCH 69/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.17it/s]
Discriminator loss: 0.1131709978208506
Generator loss:     6.443287986561768
Saved checkpoint.
--- EPOCH 70/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.79it/s]
Discriminator loss: 0.08913347511587286
Generator loss:     6.2660031130439355
Saved checkpoint.
--- EPOCH 71/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.47it/s]
Discriminator loss: 0.08015771848814827
Generator loss:     7.241267472281492
--- EPOCH 72/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.73it/s]
Discriminator loss: 0.0912784420345959
Generator loss:     7.051414177830058
--- EPOCH 73/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.06it/s]
Discriminator loss: 0.10338065411923524
Generator loss:     7.088659754373078
--- EPOCH 74/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.17it/s]
Discriminator loss: 0.09007283709103003
Generator loss:     7.23390133219554
--- EPOCH 75/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.18it/s]
Discriminator loss: 0.08460077601379919
Generator loss:     7.3073597976139615
--- EPOCH 76/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.80it/s]
Discriminator loss: 0.20449919858597276
Generator loss:     7.234231433474031
--- EPOCH 77/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.86it/s]
Discriminator loss: 0.1310041579312848
Generator loss:     7.178147875276723
Saved checkpoint.
--- EPOCH 78/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.25it/s]
Discriminator loss: 0.18298278242013508
Generator loss:     7.083550356384507
--- EPOCH 79/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.08it/s]
Discriminator loss: 0.10438710157024234
Generator loss:     6.663891812016193
Saved checkpoint.
--- EPOCH 80/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.90it/s]
Discriminator loss: 0.10439451145274299
Generator loss:     7.028215352753947
--- EPOCH 81/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.46it/s]
Discriminator loss: 0.15957645609750784
Generator loss:     7.599771001733336
--- EPOCH 82/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.82it/s]
Discriminator loss: 0.06825523168072664
Generator loss:     6.909244671800082
Saved checkpoint.
--- EPOCH 83/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.81it/s]
Discriminator loss: 0.13897242895642617
Generator loss:     7.241490165990098
--- EPOCH 84/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.10it/s]
Discriminator loss: 0.20109684482440912
Generator loss:     6.622801887361627
--- EPOCH 85/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.49it/s]
Discriminator loss: 0.1671155015972996
Generator loss:     7.05334020288367
--- EPOCH 86/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.45it/s]
Discriminator loss: 0.10676893160531395
Generator loss:     7.1421724226241725
--- EPOCH 87/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.98it/s]
Discriminator loss: 0.12870663656552034
Generator loss:     7.24330983305336
--- EPOCH 88/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.26it/s]
Discriminator loss: 0.04220255173014519
Generator loss:     7.437301996058988
--- EPOCH 89/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.30it/s]
Discriminator loss: 0.12061875534797073
Generator loss:     7.686244801471108
--- EPOCH 90/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.10it/s]
Discriminator loss: 0.24989488953374384
Generator loss:     7.36822489688271
--- EPOCH 91/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.43it/s]
Discriminator loss: 0.10757883038735927
Generator loss:     7.155592380609727
Saved checkpoint.
--- EPOCH 92/100 ---
100%|██████████| 133/133 [00:03<00:00, 43.48it/s]
Discriminator loss: 0.06716966836300112
Generator loss:     6.801169855254037
Saved checkpoint.
--- EPOCH 93/100 ---
100%|██████████| 133/133 [00:03<00:00, 42.84it/s]
Discriminator loss: 0.07060629375895164
Generator loss:     8.000765320053674
--- EPOCH 94/100 ---
100%|██████████| 133/133 [00:03<00:00, 34.51it/s]
Discriminator loss: 0.09327887127498038
Generator loss:     7.613252650526233
--- EPOCH 95/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.98it/s]
Discriminator loss: 0.23082287295868523
Generator loss:     6.970089476359518
--- EPOCH 96/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.50it/s]
Discriminator loss: 0.14272223642670123
Generator loss:     7.5348126691086845
--- EPOCH 97/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.04it/s]
Discriminator loss: 0.18904631808960348
Generator loss:     7.210255502982247
--- EPOCH 98/100 ---
100%|██████████| 133/133 [00:04<00:00, 32.93it/s]
Discriminator loss: 0.12453283137060646
Generator loss:     6.5892723879419774
Saved checkpoint.
--- EPOCH 99/100 ---
100%|██████████| 133/133 [00:04<00:00, 33.13it/s]
Discriminator loss: 0.04283176160844645
Generator loss:     7.610698440021142
--- EPOCH 100/100 ---
100%|██████████| 133/133 [00:03<00:00, 33.31it/s]
Discriminator loss: 0.045409283691779115
Generator loss:     7.678944447883089
In [13]:
# Plot images from best or last model
if os.path.isfile(f'{checkpoint_file}.pt'):
    gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
print('*** Images Generated from best model:')
samples = gen.sample(n=15, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(samples, nrows=3, figsize=(6,6))
*** Images Generated from best model:

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

In [14]:
from cs236781.answers import display_answer
import hw4.answers as answers

Question 1¶

Explain in detail why during training we sometimes need to maintain gradients when sampling from the GAN, and other times we don't. When are they maintained and why? When are they discarded and why?

In [18]:
display_answer(answers.part3_q1)

3. In every batch iteration we can see that the generator is used twice. In one case it is used to train the the generator we preserve the gradient in order to maintain an effective loss function the for training of Ψγ. In the second case that it is used when training the discriminator, we are freezing the generator in order to create images for the discriminator to be trained on. In this part, we are training only the discriminator, so we don’t care about the gradients from the generator because they are not relevant to the discriminatior training (we want to train Ψγ using the loss formula as a constant)

Question 2¶

  1. When training a GAN to generate images, should we decide to stop training solely based on the fact that the Generator loss is below some threshold? Why or why not?

  2. What does it mean if the discriminator loss remains at a constant value while the generator loss decreases?

In [19]:
display_answer(answers.part3_q2)

2.1. We should not stop the training. We want to improve the discriminator and the generator together and not solely the generator, The models are trained and tested on each other, i.e the losses could be a result of a decrease in the generator loss and an increase in discriminator loss even though both have improved, with one improving more than the other it means that the Generator successfully fooled the discriminator, but maybe the discriminator is not that good yet so the generated images are not really good. Thus the solely generative loss in this case is not an effective measurement to decide to stop because both models affect each other.

2.2. The Discriminator is trying to classify between real and fake images, therefore the decrease in the generator loss is caused by better tricking the discriminator, i.e the discriminator thinks that generated images are real. The loss is calculated from how well the discriminator is able to identify and distinguish between the real and “fake” images. that the total loss $$\mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) ))$$ is not really changing, the second term increases and the first decreases.

Question 3¶

Compare the results you got when generating images with the VAE to the GAN results. What's the main difference and what's causing it?

In [20]:
display_answer(answers.part3_q3)

3. The Generative Adversarial Network provided us with much sharper images, we can find a large variation in the expressions, backgrounds, colors and different angles of Bush. While using the Variational Autoencoder creates less sharp images, they are more smooth and smudged, they look very similar to each other and we cannot find many details as we could see in the GAN results. The images lack fine details for example, the background, facial expressions and clothing, the placement of Bush and the angle the image was taken from.

We believe that the main difference between the models comes from the difference in the final goal of the models. The GAN model's target is to trick the discriminator by thinking that a generated image is a real one. That leads to the fact that the generated images will aim to look like the real images from the dataset, including many details, different angles and expressions. While the VAE aims to create images that fits the best the probability distribution of the dataset. That leads to results that look like an average picture generator, that tries to produce pictures without noticeable difference from the real data. We can easily discriminate the results of the two models by measuring the smoothness and sharpness of the image.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $$

Part 4: Summary Questions¶

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

Notes

  • Clearly mark where your answer begins, e.g. write "Answer:" in the beginning of your cell.
  • Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
  • This notebook should be runnable from start to end without any errors.

CNNs¶

  1. Explain the meaning of the term "receptive field" in the context of CNNs.

Answer: The receptive field in the context of CNNs is a defined portion of space containing units that provide input to a set of units within a corresponding layer. It is an area of the input that one of a layer's features is affected by. e.g, if the input was an image, a receptive field of a feature could be the pixels that affect the feature's calculation.

  1. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

Answer:

There are many different ways to control the rate at which the receptive field grows from layer to layer. Here are three of them:

a. The dilation- this parameter affects the distance between pixels in the kernel. If we were to increase this parameter we are giving the filters the ability to calculate patterns that are more spread out without increasing the kernel size.

b. The kernel size- if we were to increase the kernel size it will cause the area that the feature is affected by to increase too, that also leads to growth in the receptive. The result of the change of the size, an increase or a decrease, means more or less adjacent pixels would be combined in each filter calculation.

c. The stride size- the stride size is a measure of the distance between two filter activations. Decreasing the stride size increases the receptive field of a feature. This parameter is responsible for the relation between different pixels in the input,if we are increasing the stride we are causing relatively distant regions to interact together in future layers.

  1. Imagine a CNN with three convolutional layers, defined as follows:
In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape
Out[1]:
torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

Answer:

The size of the receptive field after N convolutional layers, where s is the stride and k is the kernel size, equals: $1 + \sum_{i=1}^{N} ((k_i-1)* (\prod_{j=1}^{i-1} s_i))$ If we would like to use the above formula we need to also take care of the ReLu activation function, the dilation parameter and the Pooling layers. ReLu- we can actually ignore these layers since they don’t affect the receptive field’s size. Dilation- we can define a new kernel size, which will be the effective circumference of the dilated kernel, that means that it dimensions are 1+2*(7-1) = 13 since we have 2 pixels between each original kernel pixel. Pooling- the pooling layer with kernel of size 2 is effectively a convo-layer with both kernel size 2 and stride. Now we can apply the formula above on all of the model’s layers:

(2−1)⋅1+(3−1)⋅1+(5−1)⋅2⋅1+(13−1)⋅2⋅2⋅2⋅1+1+(2−1)⋅2⋅2⋅1 = (2−1)⋅1+(5−1)⋅2⋅1+112(3−1)⋅1+(13−1)⋅2⋅2⋅2⋅1+1(2−1)⋅2⋅2⋅1 = 112

That means that we have a receptive field of 112 for every pixel in the output tensor.

  1. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

    After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

    However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

Answer:

Using ResNet networks we can optimize our model better by using gradients that would have otherwise vanished without the layer skipping. Moreover, as a result of the different structure of the model, each layer learns a different function than before (by subtracting the input from the original function). The result of these changes, leads to a difference in the filters.

Dropout¶

  1. True or false: dropout must be placed only after the activation function.

Answer:

false.

The ReLu activation function before or after the Dropout layer will result in identical output because the ReLu activation function maintain f(0) = 0.

  1. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

Answer:

The average after an activation function is E(x). Therefore in order to add dropout layer with probability of p we need tha the average will be (1−p)⋅E(x). So the scaling needed is 1/(1−p) to reach to the original E(x).

Losses and Activation functions¶

  1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

Answer:

Using the L2 loss is a bad idea in this case. Correct classifications may have larger losses than incorrect classifications. For example, two dog images could give the following scores:

first sample: score1 = 0.4, score2 = 1 -----> L2_error = 0.5(0.4^2 + 1^2) = 0.58, correct classification

second sample: score1 = 0.6, score2 = 0.6 ------> L2_error = 0.5(0.6^2 + 0.6^2) = 0.36, no correct classification

In the first sample correct classification got larger loss then the second sample although that in the second sample there are no correct classifications.

Better loss function for classification problems is the Binary Cross Entropy that is better for binary classification as we saw in the course.

  1. After months of research into the origins of climate change, you observe the following result:

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in N locations around the globe. You define your model as follows:

In [2]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H),
        nn.Sigmoid(),
    ]*N,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations. It seems that your model is no longer training. What is the most likely cause?

Answer:

The cause is most likely a result of vanishing gradients. The model presented in the question above has N adjacent linear layers that use the sigmoid activation function, this is considered a deep network that is prone to vanishing gradients. Another thing that we noticed is that the number of pirates axis got large numbers, therefore resulting in the activation function to reach its flat region that corresponds to small gradients.

  1. Referring to question 2 above: A friend suggests that if you replace the sigmoid activations with tanh, it will solve your problem. Is he correct? Explain why or why not.

Answer:

Our friend advice is not so helpful. the activation function tanh is just a rescaled sigmoid function, meaning that it is also prone to vanishing gradients with large numbers as explained in question 2. even so that the tanh function is scaled such that the effect of vanishing gradients will be delayed , it's likely that this "improvement" will be negligible considering the large scale of number of pirates axis, and the depth of the network.

  1. Regarding the ReLU activation, state whether the following sentences are true or false and explain:
    1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
    2. The gradient of ReLU is linear with its input when the input is positive.
    3. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

Answer:

A. False. even that ReLu itself doesn't normally cause vanishing gradients like sigmoid and tanh (becuase its gradient is 1 for positive values and 0 for non-positive), there are more factors to consider like network depth or other network layers may cause vanishing gradients.

B. False. the gradient of the ReLu function for postives value is a constant 1. this may be consdierd linear by defintion but it doesn't follow the properties of linearity we usually want, so we decided to go with false.

C. True. During a forward step "dead" neurons can be created, since any negative neuron will be zeroed after the activation layer (by defintion of ReLu function). If this happens with multiple inputs, the neuron will be regarded as useless. Its clear that also the gradients for this neurens will be 0, so the weight corresponding to this neuron will be negligible, making it "dead" in future training process as well.

Optimization¶

  1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

Answer:

This three descents algorythes update themselves in the direction of the maximal loss decrease, they differ on the calculation of the loss.

SGD: The loss is calculated using one, randomly chosen data sample.

GD: The loss is calculated using the average loss of all the data samples.

Mini-Batch SGD: A combination between both of them. The loss is calculated using an avarge of randomly chosen batch of data.

  1. Regarding SGD and GD:
    1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
    2. In what cases can GD not be used at all?

Answer:

A.two reasons SGD is more used beacuse:

1.as we explained before the SGD uses one data sample compared to GD using all the data, therefore using SGD is faster and requires less memory.

2.SGD has the abillity to escape from local minimum, because the loss function used is changed according to the randomly chosen instance. Also, SGD updates are faster since its calculations are simpler and shorter. These factors makes SGD usually converge faster then GD.

B. When we are using large data, it may be impossible to compute the backpropagation beacuse its may be impossible to store all the relevant loss functions and gradients in the memory making regular GD impossible.

  1. You have trained a deep resnet to obtain SoTA results on ImageNet. While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average. Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM. You're now considering to increase the mini-batch size from $B$ to $2B$. Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

Answer:

We expect the number of iterations to decrease. Each update will be an update from larger batch, thus there will be less noise and precisnes on each iteration. therefore, we expect the updates to be more precise, results in a decrese in the number of iterations needed to reach some good results. Notice that less iterations doesn't necceserally mean less time, because each batch is larger so each calculation requires more time and memory.

  1. For each of the following statements, state whether they're true or false and explain why.
    1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
    2. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
    3. SGD is less likely to get stuck in local minima, compared to GD.
    4. Training with SGD requires more memory than with GD.
    5. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
    6. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

Answer:

A. False. we take the sample randomly from the data, so we dont have to train every sample.

B. False. SGD has bigger varince then GD becuase while GD compute always all the gradient from all the data SGD uses a single sample to calculate the gradient, therefore it has a larger variance than regular GD that always has the same loss function.

C. True. As we explained before regular GD uses the same loss function meaning that we can get stuck in a local minimum, SGD that has randomization that may help it escape from local minimum due to a larger variance and randomness in the gradients calculated.

D. False. SGD uses only one instance while GD requires more memory since it calculates all loss gradients and sums them up.

E. False. As we explained before, GD may get stuck in a local minimum since it has no randomized factor to help it escape.

F. False. Using momentum we may take larger steps so we may overstep the narrow ravine unlike Newton's method where we would take smaller steps and probably not miss the lowest point in the narrow ravine.

  1. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
    1. True or false: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc). Provide a mathematical justification for your answer.

Answer:

False. For polynomial function we can find its minimum/maximum by simple derivitive so we dont need gradienrs methods.

  1. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$. Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
    1. Explain the concepts of "vanishing gradients", and "exploding gradients".
    2. How can each of these problems be caused by increased depth?
    3. Provide a numerical example demonstrating each.
    4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

Answer:

A.Exploding gradients are gradients that becomes too large to be able to convey meaningful information that can train the network. ֿVanishing gradients are gradients that appear in beginning of the backpropagation and are unable to propagate into later stages without becoming negligible.

B. In deeper networks, chain reactions where some layers have large/small gradients can cause the total value to grow exponentially, since the backpropagation connects all the layers' gradients allowing this chain effect.

C. Let's say we have a network with K layers, each layer is a 1x1 matrix $w_i$. Working with the L2 as a loss function, given a sample x with a classification y = 0 we receive a loss value of: $l2=(w_1w_2...w_{K−1}w_Kx)^2 $ In the backpropagation, we calculate the derivative relative to the last layer, that means we calculate: $\frac{\partial l2}{\partial wK} = 2(w_1w_2...w_{K−1}w_Kx)⋅w_1w_2...w_{K−1}x$ Let's assume that all the weights are equal, marked as V, we receive the total value of: $2V^(2K−1)x^2$. TIn order to get an exploding gradient we can choose V>1 and a vanishing gradient we can choose V<1, if we are using a deep enough network-large K.

D.In case that the weights are stable (relatively), we are assuming that we do have vanishing gradients because they are not big enough in order to change the current weights.But, if the the loss is not stable with no trend of improvement and the weights were to change dramatically, we can expect exploding gradients, which are too large to “carry” relevant information.

Backpropagation¶

  1. You wish to train the following 2-layer MLP for a binary classification task: $$ \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 $$ Your wish to minimize the in-sample loss function is defined as $$ L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right) $$ Where the pointwise loss is binary cross-entropy: $$ \ell(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $$

    Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

Answer: We will start with computing the derivatives of $l$ and $\hat{y}$

$\frac{\partial \hat{y}}{\partial x} = W_2*\varphi'*W_1$

$\frac{\partial \hat{y}}{\partial b_2} = 1$

$\frac{\partial \hat{y}}{\partial b_1} = W_2*\varphi'$

$\frac{\partial \hat{y}}{\partial W_2} = [\varphi*(W_1*x+b_1)]^t$

$\frac{\partial \hat{y}}{\partial W_1} = W_2*\varphi'*x^t$

$\frac{\partial l}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}*(1-\hat{y})}$

Now using the chain rule, such that $ \frac{\partial l}{\partial \hat{y}} *\frac{\partial \hat{y}}{\partial k} = \frac{\partial l}{\partial k}$ we are able to calculate that: $\frac{\partial L}{\partial W_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *W_2*\varphi'*x^t)+\lambda*W_1$

$\frac{\partial L}{\partial W_2} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *[\varphi(W_1+b_1)]^t)+W_2*\lambda$

$\frac{\partial L}{\partial b_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *W_2*\varphi'$

$\frac{\partial L}{\partial b_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})}$

$\frac{\partial L}{\partial b_1} = \frac{1}{N} * \sum_{n=1}^{N}(\frac{\hat{y}-y}{\hat{y}*(1-\hat{y})} *W_2*\varphi'*W_1$

  1. Given the following code snippet, implement the custom backward function part4_affine_backward in hw4/answers.py so that it passes the asserts.
In [3]:
from torch.autograd import Function

from hw4.answers import part4_affine_backward

N, d_in, d_out = 100, 11, 7
dtype = torch.float64
X = torch.rand(N, d_in, dtype=dtype)
W = torch.rand(d_out, d_in, requires_grad=True, dtype=dtype)
b = torch.rand(d_out, requires_grad=True, dtype=dtype)

def affine(X, W, b):
    return 0.5 * X @ W.T + b

class AffineLayerFunction(Function):
    @staticmethod
    def forward(ctx, X, W, b):
        result = affine(X, W, b)
        ctx.save_for_backward(X, W, b)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        return part4_affine_backward(ctx, grad_output)

l1 = torch.sum(AffineLayerFunction.apply(X, W, b))
print(l1.backward())
W_grad1 = W.grad
b_grad1 = b.grad

l2 = torch.sum(affine(X, W, b))
W.grad = b.grad = None
l2.backward()
W_grad2 = W.grad
b_grad2 = b.grad

assert torch.allclose(W_grad1, W_grad2)
assert torch.allclose(b_grad1, b_grad2)
None

Sequence models¶

  1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    2. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

Answer:

A. A word embedding is a learned representation for text where words that have the same meaning have a similar representation. That can be achieved by encoding the words in a language as numerical vectors. In a way that semantically closer word will have closer encodings as well. This approach allows us not only to represent the words in a compact form but also preserves their semantic meaning. When working with a model that its input is a language, there is a large significance in preserving the semantic meaning because very different words may have similar semantic meaning, so we want their embedding to be close too.

B. Trying to train a model without word embedding will be very challenging. Because emotion can be depicted in very specific words. If we were to loose the word that provided us with the ability to create a spectrum of emotions, it would be hard to create a network, rather impossible to create, that can react specifically to the words without using the encoded data's value compared to other words.

  1. Considering the following snippet, explain:
    1. What does Y contain? why this output shape?
    2. How you would implement nn.Embedding yourself using only torch tensors.
In [4]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")
Y.shape=torch.Size([5, 6, 7, 8, 42000])

Answer:

A. Y: is containing the embedded codes of the vocabulary that is in X. X’s shape is (5, 6, 7, 8) containing values from the range (0, 41). Y embedding dimension is 42000, which is the size of the embedding vector, that means that every value in x is embedded by a vector sized 42000.

B. We can simply create an nn.Embedding using the following matrix which its size is vocabulary−size times embedding−dimension, every word can be represented as a number from zero to the number of words that there is in the dictionary - 1. The embedded value is the corresponding row within the matrix.

  1. Regarding truncated backpropagation through time (TBPTT) with a sequence length of $S$: State whether the following sentences are true or false, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length $S$.
    3. TBPTT allows the model to learn relations between input that are at most $S$ timesteps apart.

Answer:

a. The answer is T, since the computation is the same for both algorithms, there is a difference which is that the TBPTT does the backpropagation for a constant number of steps.

b. The answer is F. It is not enough to only limit the length of the sequence provided to the model, we also have to determine the length of the truncation, by that we mean, to determine how many timesteps to look at when backpropagating.

c. The answer is F. In short, hidden state, more elaborately, the hs is still holding information across sequences, that means that inputs will be indirectly affected by all of the previous inputs that gone through the state.

Attention¶

  1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.

    1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
    1. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?

Answer:

A. The addition of the attention to the encoder and decoder cause that the hidden states from the encoder and decoder are more likley to be focused on important part of the sequence, beacuse the decoder has the attentions context added to the input. Also, differ to the model without attention we now dont have to deal with the problem of hidden states losing context after long sequences because past hidden states of the encoder are saved.

B.The influnce to the encoder is that now all the states are saved therefore each hidden state can represent a part of the sequence. The influnce to the decoder is that no data is saved therefore each hidden state will have to incorprate the meaning of whole sequence.

Unsupervised learning¶

  1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term. What would be the qualitative effect of this on:

    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    2. Images generated by the model ($z \to x'$)?

Answer:

A. We are using the KL divergence in order to make the probability distributions q(z|x) and q(z|x) and as similar as possible. If we were not to use this divergence and only the reconstruction loss, we may get very similar images between the reconstructed images and the encoded images, that will happen since we try to minimize the reconstruction loss.

B. During the generation, we are using the probability distribution approximated when we are training. Because we are not using the KL divergence to calculate the probability distribution's loss, we will get that the distribution used to generate the images will be very different from the optimal distribution, that will lead to the fact that the generated images will more likely have a poor quality.

  1. Regarding VAEs, state whether each of the following statements is true or false, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

Answer:

A. The answer if False. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\mat{\mu}_{\alpha}(\mat{x}), \mat{\Sigma}_{\alpha}(\mat{x})$

B. The answer if False.We will not get the same result since we are using a probability distribution in order to encode and decode the images, that means the reconstructed images will most likely change from time to time.

C. The answer is True. Calculating the VAE loss will cause the need of calculating an intractable integral because it requires knowing the evidence distribution. But, instead we can use the Evidence lower bound loss function in order to to minimize the maximum loss created.

  1. Regarding GANs, state whether each of the following statements is true or false, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
      1. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

Answer:

A. The answer if F. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\mat{\mu}_{\alpha}(\mat{x}), \mat{\Sigma}_{\alpha}(\mat{x})$

B. The answer if F.We will not get the same result since we are using a probability distribution in order to encode and decode the images, that means the reconstructed images will most likely change from time to time.

C. The answer is T. Calculating the VAE loss will cause the need of calculating an intractable integral because it requires knowing the evidence distribution. But, instead we can use the Evidence lower bound loss function in order to to minimize the maximum loss created.

A. The answer is F. No, since having a high discriminator loss could result in it not being good enough to train in parallel with the generator, since it does not discriminate as good as the generator produces results, it will not be good enough to make the generator need to improve.

B. The answer is F. No, actually we do not update one model's parameter while training the other. To be more specific, while training the discriminator, we fixate on the generator model, which means that backpropagation into it isn't necessary.

C. The answer is T. This is how we would generate images. While training, we are creating a latent space distribution that is normally distributed, this provides us with the ability to sample it, and map it to an image.

D. The answer is T. If we have trained the discriminator on the dataset beforehand it would have helped in differentiating between relevant ones and random images, that will provide us with better results in the first few epochs, instead of having a flipping coin success rate, we will get a better assessment for our generator. That will lead to acceleration in the training process.

E. The answer is F. When the discriminator reaches a stable state that means it is not able to improve anymore, that happens when it has an accuracy of fifty percent, which at this point means that it is good at guessing the authenticity of the image. As a result of the lack of ability of the discriminator to differentiate between the generated and the original images, the generator is not able to improve anymore using this discriminator, therefore further training is futile.

Graph Neural Networks¶

  1. You have implemented a graph convolutional layer based on the following formula, for a graph with $N$ nodes: $$ \mat{Y}=\varphi\left( \sum_{k=1}^{q} \mat{\Delta}^k \mat{X} \mat{\alpha}_k + \vec{b} \right). $$
    1. Assuming $\mat{X}$ is the input feature matrix of shape $(N, M)$: what does $\mat{Y}$ contain in it's rows?
    2. Unfortunately, due to a bug in your calculation of the Laplacian matrix, you accidentally zeroed the row and column $i=j=5$ (assume more than 5 nodes in the graph). What would be the effect of this bug on the output of your layer, $\mat{Y}$?

Answer:

A. Y rows contains the output feature map for the node in the same row in X. The output feature map is the weighted sum $$ \mat{\Delta}^k \mat{x}^l $$

B.After zero the 5 row in the Laplacian matrix the output feature of node 5 will be zero

B. Zeroing the 5th row in the Laplacian matrix means that fifth row of $$ \mat{\Delta}^k $$

zero as well that makes the output features of the 5th node always$$ \varphi\left( \vec{b} \right). $$ This will make the output feature map of node 5 negligeble. Also, if we zero the 5 col in laplacian matrix the 5 node will not affect the calculations of other nodes weights when avarging with the power of the laplacian, so the output features are calculated with the fifth node but not taking it when we do the avarge.

  1. We have discussed the notion of a Receptive Field in the context of a CNN. How would you define a similar concept in the context of a GCN (i.e. a model comprised of multiple graph convolutional layers)?

Answer:

While receptive fields in CNN are based on the closness of pixels in the image, GCN receptive fields are based on the geomtrical structure of the graph, i.e we take into account the connection between nodes. The receptive field for every output feature is the spatial extent of the node in last layers that affected this feature(k-ring of the node).